Bugfix/geoparquet storage

tgrandje commented 1 month ago

Should handle the geoparquet format for geodataframe
Replace multiprocessing.Pool by pebble.ThreadPool

tgrandje commented 1 month ago

2 tests failed on my machine:

FAILED tests/download/test_pynsee_download.py::MyTests::test_download_file_all - AssertionError: False is not true
FAILED tests/geodata/test_pynsee_geodata.py::TestFunction::test_get_geodata_short - ValueError: !!! geometry column is missing !!!

Second one might be linked to a proxy configuration (there is a requests Session built on the fly among the test): let's see what is happening on github.

First failure shouldn't be linked to this PR (I'll wait for the results on github before further inspection if I can)

tgrandje commented 1 month ago

Ok, I think I've fixed the failing tests on my machine.

I'll also try a patch to requests using requests-cache: I'm not sure if github is reusing the same machine everytime it starts a bunch of tests on each python version. If that's the case, we should have fastest results for the whole benchmark (as requests will find previous results from disk instead of performing http(s) requests). We'll see what the results are. Note: I have to rewrite some of the code, as the session object should be created outside multithreading/multiprocessing to avoid crashing the kernel.

By the way, I seem to recall a discussion validating the testing of only lower and upper python version. Wasn't it ever implemented? Should I do that at the same time?

tgrandje commented 1 month ago

Am I right to assume the tests are failing because they are checked against the workflows yml from the master branch? (The latest failed test has no step named "Add requests-cache", though you find it in the linked yaml file from the result's page...).

Any idea on how to test the impact of requests-cache on multiple tests? Maybe a temporary adding to the requirements-extra.txt file if that's ok with you?

(On my machine, without clearing the cache : first run takes ~30min. Next one: ~5min. The sqlite cache is 1.3 Go though, so not cached as an artifact right now.)

tfardet commented 1 month ago

By the way, I seem to recall a discussion validating the testing of only lower and upper python version. Wasn't it ever implemented? Should I do that at the same time?

Yes, please! (3.8 and 3.12) But this and the caching should be in a separate PR so we're sure that no bugs can be overlooked.

Am I right to assume the tests are failing because they are checked against the workflows yml from the master branch?

Yes, I don't think there is any way to avoid this since otherwise any random malicious PR could leak secrets.

Regarding the geoparquet issue, do I understand correctly that the only reason this fails is because of the GeoFrDataFrame?

tgrandje commented 1 month ago

But this and the caching should be in a separate PR so we're sure that no bugs can be overlooked.

Yes, I got carried away... (Started this as resolving a pure geodataframe/arrow issue, then bunked into the mutiprocessing issue, then to 6 hours tests...). If that's ok with you, I'll let the multithreading fix into this one, as that prevented me to carry the tests on my laptop. I'd rather also let the rewriting on session outside of multithreading scope : this is not necessary for classical purposes, but patching requests on the test resulting on crashing the kernel during multithreading. So I'd say that's a 50/50 share between the 2 PR...

I'll revert the caching and python test version into another PR.

tgrandje commented 1 month ago

Regarding the geoparquet issue, do I understand correctly that the only reason this fails is because of the GeoFrDataFrame?

No, there was a unresolved bug still on my first commits. Once I remove the caching and workflow alterations, this should work fine.

tfardet commented 1 month ago

No, there was a unresolved bug still on my first commits. Once I remove the caching and workflow alterations, this should work fine.

I meant the original issue of not being able to write the parquet file

tgrandje commented 1 month ago

Oh, sorry :-)

So yes, I suppose that's one way of thinking: if the GeoFrDataFrame inherits from pd.DataFrame and not gpd.GeoDataFrame, then the .write_parquet method is not the good one. Like I said in the issue, I'm in favor of adding geopandas as a dependency, but I can understand @hadrilec 's reserves. The compromise to move geopandas into the optional dependencies seems a good one, and that's what i dit in the latests commits.

tfardet commented 1 month ago

I'm in favor of adding geopandas as a dependency, but I can understand @hadrilec 's reserves. The compromise to move geopandas into the optional dependencies seems a good one, and that's what i dit in the latests commits.

I'm not a fan of that compromise as it adds complexity to a code that, as we've seen, is already quite elaborate. I did not see anyone complaining about geopandas installation over the past year, except one person stuck with python 3.7 on a windows where they could not update anything, and this was just related to not being able to use the version they wanted (we need 0.8+ so this is a non-issue for us).

tgrandje commented 1 month ago

I agree with you, but I'm also in favor of a quick release of the bugfix (it is heavily slowing some algorithms of mine by downloading 100Mo dataset over and over...).

InseeFrLab / pynsee

Bugfix/geoparquet storage #204