fititnt / spatial-data-maching

https://sdm.etica.ai/v/
GNU Affero General Public License v3.0
0 stars 0 forks source link

Handle use cases where the number of load files is very high #1

Closed fititnt closed 1 month ago

fititnt commented 1 month ago

While testing load all files from AllThePlaces dump (this dataset is used by https://codeberg.org/matkoniecz/improving_openstreetmap_using_alltheplaces_dataset/ and mentioned on https://www.openstreetmap.org/user/fititnt/diary/404303#comment57942) availible here https://www.alltheplaces.xyz/ (exact dump link https://alltheplaces-data.openaddresses.io/runs/2024-08-03-13-32-47/output.zip. description "(267.1M) 2,967,929 rows from 2625 spiders, updated Mon Aug 5 03:16:45 AM UTC 2024" I noticed the loading process:

  1. does not peak CPU usage (at least with big files, for several seconds it would use at least one core)
  2. the interface does list datasets (however several of them are empty, including ones that have data)
  3. the console will raise several errors (not at the first files loaded, but after some point)
  4. The total availible loaded into memory for usage is far, far less than 2M items.

I think the loading process needs to be reworked, even if make a bit slower on all cases. Currently, it tries to process all files asynchronously and this testcase have a lot of files: 2623 files (474 of these are empty geojson)

fititnt commented 1 month ago

Humm. Turns out that the interface is loading all data (including the v0.5 https://sdm.etica.ai/v/0.5/) even without changes. However, the count preview is written into the tabs earlier than the computation is finished. So, do exist a minor problem.

One way to know the underlining data is (after wait to finish the loading) use the console to print SDMData (or WorkingData in the v0.5.0)

fititnt commented 1 month ago

I released v0.5.3.3 beta (also a snapshot at https://sdm.etica.ai/v/0.5.3.3/). Turns out that both v0.5.3.2 (including the v0.5.0.0 documented in the diary) where loading the data into the memory (which could be debugged manually by dumping the variable SDMData or WorkingData). However, the part of the code that updates the statistics in the UI was doing it too early. This problem was less obvious with test data used in the diary, but with >2000 files from AllThePlaces it became very obvious.

Also, attached two images.

Screenshot from 2024-08-06 19-23-53


Screenshot from 2024-08-06 19-28-14

Text description from the images:

  1. 4 files from the current release of AllThePlaces in GeoJSON format were discarded (edward_jones.geojson , fedex.geojson, uhaul.geojson, sparkasse_de.geojson). The reason was the bad JSON format. To be fair, I don't plan any workaround if input data is invalid (it would overcomplicate the logic). Bater expect user make sure GeoJSON are valid GeoJSON.
  2. Trivia: GeoJSON Text Sequence (the one line by line) very likely would discard just the line (not the entire file)
  3. The memory usage (while having less millions than the previous test data) was initially around 2GB. Makes sense, since each item has much more metadata.
  4. The difference between marking all 2623 files and finishing to load and preprocess each item was 1min0s (redoing the initial test datasets used in the diary it was around 20s).
  5. After all data is ready in memory, the filters take more milliseconds than previous datasets. Not as bad as 1s, but not as good as maximum 100ms.
  6. The export data will not work without rewriting the save as file to create the blob in chunks/stream (don't have the link now). However in the meantime as long as the user tries to export smaller files at once it could work. (Or I make it split files and force the browser to save multiple parts.)

Other comments:

  1. I think any more complex logic (like try to do data matching/conflation) with the live filter will surely take more time than simpler filters (like make it only show addr:country=BR). If things get frequent, maybe it could be a good idea to increase more time at preprocessing into the memory.
  2. The initial memory usage is somewhat near the same size of the uncompressed files on disk. However, something I didn't do was (at import into memory) convert fields that could be numeric as numbers and maybe other strategies to "compact" the representation of each element in memory.