ets cache size issue - Githubissues

RobStallion commented 5 years ago

I am currently having issues trying to cache the ukpostcodes.csv. If I cache the entire csv file, understandably, the cache is pretty massive.

a little over 632MB to be exact.

So what I tried to do instead is first filter the csv's list of postcodes so that it only contains the list of postcodes that are in our database (this is just a first step to try to minimise the cache size, not an end goal). However, even with this solution in place the cache size remained exactly the same.

I thought that maybe it was because it was cached, so I killed the process and restarted it. However it was still the same size.

Next, I checked the ets storage to see if it contains postcodes that it shouldn't (postcodes that are not in out db) and it didn't (but it does contain the postcodes from the db). This left me even more confused.

In order to make sure that this was not some issue with a certain amount of memory being allocated as soon as ets hits a certain size, I deleted about half of the records from the original csv and ran the caching process. This worked as I would have expected and the cache size was around 300MB.

I can confirm that the filtered list of elements from the csv is now 179841 long (about 10 times smaller than the original) and I can confirm that these are the only postcodes that are being entered into the cache. However, the cache size still remains at about 632MB.

RobStallion commented 5 years ago

It looks like the genserver is what is taking all the space, not the cache. This may explain why the postcode cache size is not changing except for when the file itself changes.

Looking into this now.

RobStallion commented 5 years ago

Okay so I think that I have resolved this issue now.

The initial problem was the the genserver was doing a file.read of the ukpostcodes.csv. This essentially created a double of the csv file in the code.

The next issue was mapping over this data. Mapping over the data created new maps which were not always needed but were still being stored as they were returning new maps. I have changed the map to an each function and that has also helped massively to lower the amount of memory being used.

Looks like we can store all the post codes in ets now with nowhere near as much overhead. The process is now only taking up about 2MB (a massive improvement over 632MB 😑).

Below are screenshots of the memory usage.

1.8 million postcodes

when only storing postcodes similar to venue postcodes (about 200k) ....

RobStallion commented 5 years ago

Just tried this with a copy of the production db to be sure...

Looks like this should solve the issue.

nelsonic commented 5 years ago

@RobStallion nice detective/debugging work! thanks for sharing. 👍 (consider adding even more detail to your comments so people can learn more...) 😉

nelsonic commented 5 years ago

@RobStallion was the learning on this written up somewhere that others can learn from and debug if/when it happens again? I ask because of: https://github.com/club-soda/club-soda-guide/issues/498

RobStallion commented 5 years ago

@nelsonic Other than in issues it hasn't been. I have been looking into ways of querying an ets table, #3 and I am about to try a new approach for the caching system.

My thoughts are that if we cache the store/venue info then we may be able to reduce the size of the cache considerably as we will not be storing anywhere near as many rows as we are currently.

I still need to test this to make sure this is the case, it could for example end up using far less rows but because each row contains more info, still end up taking more space in memory. It may also have a negative effect on search times. I am going to be updating the issues as I go with the results of my testing (as I am now in a place where I know how to use complex queries on an ets table, see #3 last comment).

Also, this may not work the same way with CS as a lot of other data relating to venues is often loaded when a venue is loaded, e.g. venue_types, venue_images.

In CS we are currently using the postcode cache as a way of validating postcodes that users search for. If the postcode is in the cache we know that it is correct and we do not need to send an api request to a third party for validation. If the postcode is not in the cache, we get it validated and add it to the cache so that subsequent searches use the cached postcode. If it is the postcodes that are the cause of the memory issue then perhaps we decrease the number of postcodes we store in the cache on start up, and (as you mentioned) periodically remove older postcodes that have not been searched?

To answer your question about documenting this (sorry I went off on a tangent here), if I can get the ets table working as a viable alternative to postgres, then I plan on incorporating it into this readme.

If not then I will create a separate tutorial for how to use ets in learn-elixir (I think it will probably be a good idea to add a section on ets to learn-elixir no matter how this readme goes actually https://github.com/dwyl/learn-elixir/issues/103)

dwyl / phoenix-uk-postcode-finder-example

ets cache size issue #2