Closed tgrandje closed 4 days ago
Thanks for that, should we add ignore_errors=True
to rmtree
?
Ok, so that should be good to go once the last tests are completed. @hadrilec @tfardet I'll let you review the PR
Reading the PR in more details, I'm actually not sure I understand what should be going on here.
For instance, you're clearing the tmpdir generated by a function relying on lru_cache
for the value it returns... a value which relies on the tmpdir!
Basically there are three possibilities here:
atexit
functionlru_cache
from get_capabilities
Yes you're right. In fact, I first went a bit agressively on the cleaning after it froze my WSL partition and had to went back after seeing the lru_cache trick. I might have missed some checks, and I mostly relied on the automated tests to find what went well and what went wrong. That may entirely be the case on the nested lru_cache you've found (the nesting being the reason why emptying the file from the tempdir in the inner function does not trigger an exception).
As far as I can see, the tempdir creation is not consistent over the whole package. I suspect the lru_cached function creating the tempdir was created later on pynsee's creation to avoid continuous hardrive writing. There are still some other non-lru-cached tempdir created though.
I'm not sure many tempdirs should be conserved longer than short-term. To be honest, some tempdirs are not even not necessary as you could perfectly unzip a file without writing to the hardrive, using an io object (with the eventuality of this beeing not right with nested zips...). But that should be something for a future PR...
I think I'm good.
As I've said, I removed what seemed to me unnecessary calls to tempfiles. I've kept it on:
_download_store_file
(which no serve a fixed temp directory to _download_file
and _import_options
). It should be cleaned up by itself ; we could rewrite those functions avoiding tempdir altogether, but the size of those files (> 1 Go of CSV) might be too much for some machines to hold in RAM_get_temp_dir
, which is only used by _create_insee_folder
anymore. I've taken the liberty to add a warning to the user, displaying that if platformdirs fails, pynsee can not be expected to cleanup the tempdir at the end of the python session.I hope there's not too much mistakes on that PR. If there are (which is probably the case), I will be absent for 2 weeks, so feel free to correct it by yourself if needed.
EDIT
I 've also lightly altered the get_temp_dir
to simplify the code. The path to the insee folder should not have changed (if that's the case, then it is a mistake).
I've also had some thoughts for future PRs:
_create_insee_folder
, it might be worth checking if functions can be rewritten to use the @save_df
decorator instead (it could simplify maintainance on the long run)diskcache
and that could be reused at almost no cost (but at the light downfall to have one more depedency) ; update
argument and at first reading I'd say this is not necessary at all, the function being called by download_file (which is using the @save_df
decorator)If that's ok with you both @hadrilec @tfardet , fill free to open issues on this without me.
Nominatim is called from _get_location_openstreetmap which is, in my opinion, compliant with the guidelines
wherever there is a call to
_create_insee_folder
, it might be worth checking if functions can be rewritten to use the@save_df
decorator instead (it could simplify maintainance on the long run)
I plan on restarting the config PR soon, breaking it into smaller bits this time, this is definitely on my todo list and should indeed go away
maybe we should use black once and for all to cleanup the whole code. There is a lot of "garbage" on my PR due to black cleaning up the code on each modified file (sorry about that 😳) ;
good point, we could also add this as a pre-commit hook
I've added a "TODO" comment on download/_download_store_file. This function is not using the
update
argument and at first reading I'd say this is not necessary at all, the function being called by download_file (which is using the@save_df
decorator)
OK, I'll check and will open an issue about it if it seems necessary
Nominatim is called from _get_location_openstreetmap which is, in my opinion, compliant with the guidelines
Yes about the User Agent (sorry I missed that, I was looking at the session object and didn't even saw you added it on each get
call...) and the cache (let's say I've no excuse for missing that...).
And also yes about the rate. I've just tested your example from the doc and it queries Nominatim at a rate of 1.4it/sec. Might be something I misunderstood on requests retry system (probably) or even the lagtime to get a query and write it to the disk (which I doubt). I'll have to check what I've missed once I get some time. Well, sorry about that also.
😳
@hadrilec @tfardet do you want to review this once more or should this be merged?
I will have a look, thanks
Should handle that issue https://github.com/InseeFrLab/pynsee/issues/205