InseeFrLab / pynsee

pynsee package contains tools to easily search and download french data from INSEE and IGN APIs
https://pynsee.readthedocs.io/en/latest/
MIT License
67 stars 8 forks source link

Bugfix/clean tmpfiles #206

Closed tgrandje closed 4 days ago

tgrandje commented 1 month ago

Should handle that issue https://github.com/InseeFrLab/pynsee/issues/205

tfardet commented 1 month ago

Thanks for that, should we add ignore_errors=True to rmtree?

tgrandje commented 1 month ago

Ok, so that should be good to go once the last tests are completed. @hadrilec @tfardet I'll let you review the PR

tfardet commented 1 month ago

Reading the PR in more details, I'm actually not sure I understand what should be going on here.

For instance, you're clearing the tmpdir generated by a function relying on lru_cache for the value it returns... a value which relies on the tmpdir!

Basically there are three possibilities here:

  1. the file is actually supposed to be stored for a long time and we move it to the cache folder (unlikely but @hadrilec will know)
  2. the file is supposed to last the whole python session, in which case we actually need to keep a list of these tmpfiles/dirs and clear them all up in an atexit function
  3. the file is really supposed to be temporary (how temporary, then?) and we should remove the lru_cache from get_capabilities
tgrandje commented 1 month ago

Yes you're right. In fact, I first went a bit agressively on the cleaning after it froze my WSL partition and had to went back after seeing the lru_cache trick. I might have missed some checks, and I mostly relied on the automated tests to find what went well and what went wrong. That may entirely be the case on the nested lru_cache you've found (the nesting being the reason why emptying the file from the tempdir in the inner function does not trigger an exception).

As far as I can see, the tempdir creation is not consistent over the whole package. I suspect the lru_cached function creating the tempdir was created later on pynsee's creation to avoid continuous hardrive writing. There are still some other non-lru-cached tempdir created though.

I'm not sure many tempdirs should be conserved longer than short-term. To be honest, some tempdirs are not even not necessary as you could perfectly unzip a file without writing to the hardrive, using an io object (with the eventuality of this beeing not right with nested zips...). But that should be something for a future PR...

tgrandje commented 1 month ago

I think I'm good.

As I've said, I removed what seemed to me unnecessary calls to tempfiles. I've kept it on:

I hope there's not too much mistakes on that PR. If there are (which is probably the case), I will be absent for 2 weeks, so feel free to correct it by yourself if needed.

EDIT I 've also lightly altered the get_temp_dir to simplify the code. The path to the insee folder should not have changed (if that's the case, then it is a mistake).

tgrandje commented 1 month ago

I've also had some thoughts for future PRs:

If that's ok with you both @hadrilec @tfardet , fill free to open issues on this without me.

hadrilec commented 1 month ago

Nominatim is called from _get_location_openstreetmap which is, in my opinion, compliant with the guidelines

tfardet commented 1 month ago

wherever there is a call to _create_insee_folder, it might be worth checking if functions can be rewritten to use the @save_df decorator instead (it could simplify maintainance on the long run)

I plan on restarting the config PR soon, breaking it into smaller bits this time, this is definitely on my todo list and should indeed go away

maybe we should use black once and for all to cleanup the whole code. There is a lot of "garbage" on my PR due to black cleaning up the code on each modified file (sorry about that 😳) ;

good point, we could also add this as a pre-commit hook

I've added a "TODO" comment on download/_download_store_file. This function is not using the update argument and at first reading I'd say this is not necessary at all, the function being called by download_file (which is using the @save_df decorator)

OK, I'll check and will open an issue about it if it seems necessary

tgrandje commented 1 month ago

Nominatim is called from _get_location_openstreetmap which is, in my opinion, compliant with the guidelines

Yes about the User Agent (sorry I missed that, I was looking at the session object and didn't even saw you added it on each get call...) and the cache (let's say I've no excuse for missing that...).

And also yes about the rate. I've just tested your example from the doc and it queries Nominatim at a rate of 1.4it/sec. Might be something I misunderstood on requests retry system (probably) or even the lagtime to get a query and write it to the disk (which I doubt). I'll have to check what I've missed once I get some time. Well, sorry about that also.

😳

tgrandje commented 2 weeks ago

@hadrilec @tfardet do you want to review this once more or should this be merged?

hadrilec commented 2 weeks ago

I will have a look, thanks