Lkruitwagen / deepsentinel

DeepSentinel: a sentinel-1 and -2 self-supervised sensor fusion model for general purpose semantic embedding
43 stars 12 forks source link

File formats for data folder? #5

Closed wtrainor closed 3 years ago

wtrainor commented 3 years ago

Hello! I've been exploring your code, and am specifically looking for examples of downloading satellite images from GEE directly to a GCP storage bucket. I'm stuck at the data subdirectory: what kind of files is the code expecting here? Looks like vae.py reads them into a dataframe, but are they actually TIFF's from GEE? or something else?

Thank you!

Lkruitwagen commented 3 years ago

Hi! Thanks for your interest!

To get data from GEE to GCP I use the GEE REST API. I've just realised that I haven't pushed a very crucial downloads script file!

The dataloader (e.g. VAE.py) parses the data records into a dataframe, and the data itself are stored in npz files. E.g.: https://storage.googleapis.com/deepsentinel/DEMO_EU_labelled/1000/1000_DL_2020-08-03_14.57753613395721_50.356188161226314_S1arr.npz

Let me know if that helps!

wtrainor commented 3 years ago

Thank you for the response and pushing the missing script. I see your storage, and am making some sense of how to use the GEE REST API. A few questions if you don't mind:

  1. I have my service key successfully working but in your readme it says I need to contact GEE customer support for REST API access, is that true even if I checked that this works:

    import ee
    service_account = 'myaccountname' 
    credentials = ee.ServiceAccountCredentials(service_account, 'myGCPprojectname')
    ee.Initialize(credentials)
  2. I'm still unclear of which scripts querying the data and put the .npz files in your bucket. Did you use these codes and if yes the following issues in running them:

    • point_generator.py: do I need to add a 'scihub' key to my DATA_CONFIG.yaml (not just 'scihub_auth')? (I assumed this was the file is is looking for
    • sample_generator.py: how can I generate the .parquet file that sample_generator.py is looking for? I see some .parquet files get written in download_catalog.py but I can't get past self.pts = pd.read_parquet(os.path.join(self.CONFIG['DATA_ROOT'], 'pts', self.version+'.parquet'))

Thank you again.

Lkruitwagen commented 3 years ago

Hey thanks again! Good questions, making me realise I have quite some stuff to fix!

To answer your questions:

  1. I think, even if you have the Python client working for GEE, you still need to be explicitly whitelisted by Google to use the REST API. I'm not sure though, perhaps that's changed in the past year or so.
  2. Yes, this is confusing! I'm creating a new dataset now so let me tidy this up a bit. It sounds like I should make a new command line interface. Let me work on this and I'll post an update here.

What are you working on exactly? I think I'm going to make a 100k sample dataset on the public Google bucket. Let me know if this would be of interest.

wtrainor commented 3 years ago

Ah ha! Ok, I just needed to start with download_catalog.py! Now I have some .parquet files to get somewhere...

So, I've previously worked on offshore oil spill and seep detection only with optical data. I'm interested in your code

  1. to learn about the REST API (which I have yet to test on my end, but seems from the documentation I have a quota of data I can use for this)
  2. to try out your optical/radar fusion code, possible on this same seep/spill application, but also on some other applications.

Thanks again for your responses.

Lkruitwagen commented 3 years ago

Hey @wtrainor I've made some adjustments and added a CLI that should help with generating the points. You can use it like: python cli.py generate-points 2019-07-01 100000 100k_unlabelled --n-orbits=31. I'll now be making sure that the image sampling also works from the cli as well as the training, will be adding this 100k sample dataset on the Google bucket.

I like your project! Do you know the NGO SkyTruth? I think they're also using SAR for oil spill detection, could be a nice collaboration? Let me know if I can facilitate an intro.

wtrainor commented 3 years ago

Thanks again @Lkruitwagen ... got sidetracked to other things but have downloaded the new codes.

Super dumb question: To get S2_utm.gpkg, do I convert a kml file from say here.?

I have heard about SkyTruth for their boat detection work. Depends on how I progress.. :)

Lkruitwagen commented 3 years ago

Never a dumb question! Yes I believe that's what I did. For convenience I've just copied them here: https://drive.google.com/drive/folders/1WrkSb2t_kTtuHq0uqszpo7d-NojLjk37?usp=sharing

Will also make a note in the README. Thanks!

wtrainor commented 3 years ago

Success on the generate-points! :) thank you so much...

Small problems on the generate-samples: I've tried python cli.py generate-samples gcp gee Sentinel-2 --conf=DATA_CONFIG_wjtg.yaml​ and get TypeError: generate_samples() got an unexpected keyword argument 'conf' then I tried python cli.py generate-samples gcp gee Sentinel-2 and python cli.py generate-samples gcp gee and both give TypeError: generate_samples() got an unexpected keyword argument 'destinations'

Is there something I'm doing wrong with the strings? I was guessing a bit on the NAME for the dataset.

Lkruitwagen commented 3 years ago

Ah! I think there's a typo: def generate_samples(name, sources, denstinations). Apols, I found this yesterday, let me push an update.

Separately, would it be useful if I put development work on a separate branch? The codebase isn't stable yet, and I don't want to make changes to things you're actively using.

wtrainor commented 3 years ago

Do whatever is best for you! I can manage. I check differences before I pull. THANK YOU AGAIN!!!

wtrainor commented 3 years ago

Soooo close @Lkruitwagen! I have all my points and matches I believe. Now trying to generate-samples python cli.py generate-samples --conf=DATA_CONFIG_wjtg.yaml Misland_v0 gee gcp

I have no tiles nor gee logs. In sample_generator.py at line 165 I get the error because _done_idx_ is empty. I tried commenting the args generator out and just calling GEE_downloader which you had commented out on line 173. But because tiles is empty, TL's has no .keys and therefore throws an error at line 463 in GEE_downloader.

How do I get the tiles or log files? I owe you a beer by the way for all my questions...

Lkruitwagen commented 3 years ago

Yes, sorry, this is quite brittle behaviour. I'd implemented a 'notiles' case but it looks like it's not working now. Have pushed fixes, needed to use a new config param unfortunately :/ so you'll need to remake or add to your config.

Looks like about 1 in 6 tiles aren't sampling the correct utm coordinates of the target image, so they return no data and raise an error. I'm looking into why this is.

Log will propagate by themselves, so don't worry about that. They're just there if you want/need to pause downloading and restart again, then you don't need to start from the beginning and can split the remaining downloads across your workers porportionally.

wtrainor commented 3 years ago

At what step (script/function) could the tiles be generated? I see that you create TLs = {kk:None for kk in pts.index.values.tolist()} for a 'notiles' case but in _get_GEE_arr you are getting the utm coordinates from TLs ... Can I generate some TLs with the existing catalogs + codes? Or can I scrap the long/lat's from the point files and convert them to utm somehow?

I would really just like to get the GEE query part working. All my previous work relied on using the Python EE API for getting images then gsutil to put on GCP bucket. I think your method would be infinitely better if I can get it to run.

wtrainor commented 3 years ago

Hi @Lkruitwagen , I got it working! I think the main problem was some convoluted reprojections in GEEdownloader.py. I will clean it up and test on a few more locations than I can commit the changes if you like. THANK YOU again for your help. I'll keep you posted on the results.

Lkruitwagen commented 3 years ago

Hi @wtrainor, that's so great to hear!!! I think I missed your message 6 days ago somehow, I'm so sorry.

Yes please, I'd love if you could commit the changes! I have a colleague who will also probably start using this repo, also to get images off of GEE, so it would be great it this could work for them as well. Thanks so much!

wtrainor commented 3 years ago

I tried to commit, but I don't think I have permission to commit to this repository. Sound correct?

Also, there is still some buggy behavior with results = P.starmap(GEE_downloader, args)

About every 10th sample using 3 workers, I'll get this error:

multiprocessing.pool.RemoteTraceback: ​ """​ Traceback (most recent call last):​ File "/opt/conda/envs/deepS/lib/python3.8/multiprocessing/pool.py", line 125, in worker​ result = (True, func(*args, **kwds))​ File "/opt/conda/envs/deepS/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar​ return list(itertools.starmap(args[0], args[1]))​ File "/home/jupyter/deepsentinel/deepsentinel/utils/downloaders.py", line 520, in GEE_downloader​ S1_arr = _get_GEE_arr(​ File "/home/jupyter/deepsentinel/deepsentinel/utils/downloaders.py", line 430, in _get_GEE_arr​ arr = np.load(io.BytesIO(pixels_content))​ File "/opt/conda/envs/deepS/lib/python3.8/site-packages/numpy/lib/npyio.py", line 444, in load​ raise ValueError("Cannot load file containing pickled data "​

If I restart python cli.py generate-samples --conf=DATA_CONFIG_wjtg.yaml Misland_v0 gee gcp then it will advance another 10 images...

Lkruitwagen commented 3 years ago

Hi @wtrainor I've invited you to the repo! I think main is still protected, but if you make your own branch we can do a Pull Request. Excited to see the commits!

Hmm yeah I'm not sure what we can do about this error. The image data is buffered into a BytesIO object so it could just be that the data isn't being retrieved properly/full from GEE. We could add an error catch and retry loop?

wtrainor commented 3 years ago

Ok, I made a new branch called DS_Pau, and committed the downloaders.py. I may have already committed to cli.py to the main branch (it was just a simple change of df. to dt.).

I'll probably work on the BytesIO issue later.

Lkruitwagen commented 3 years ago

Hi @wtrainor cool, I see the branch! No code on it yet, can you git push up your commits? Thanks so much!

Lkruitwagen commented 3 years ago

hmmm yes it looks like, but are you sure those commits haven't just been made locally? Sorry, maybe you have, I don't use the jupyter gui for git.

Lkruitwagen commented 3 years ago

ahh found a 'git push' button. Have you hit this button at the top? image ed: signing off now, will check in the AM!

wtrainor commented 3 years ago

Hi @Lkruitwagen, I think I got it done on the command line. Do you see it now?

Lkruitwagen commented 3 years ago

Merged in #6! Thanks for pushing this code :), closing this issue now.