Closed wtrainor closed 3 years ago
Hi! Thanks for your interest!
To get data from GEE to GCP I use the GEE REST API. I've just realised that I haven't pushed a very crucial downloads script file!
The dataloader (e.g. VAE.py) parses the data records into a dataframe, and the data itself are stored in npz files. E.g.: https://storage.googleapis.com/deepsentinel/DEMO_EU_labelled/1000/1000_DL_2020-08-03_14.57753613395721_50.356188161226314_S1arr.npz
Let me know if that helps!
Thank you for the response and pushing the missing script. I see your storage, and am making some sense of how to use the GEE REST API. A few questions if you don't mind:
I have my service key successfully working but in your readme it says I need to contact GEE customer support for REST API access, is that true even if I checked that this works:
import ee
service_account = 'myaccountname'
credentials = ee.ServiceAccountCredentials(service_account, 'myGCPprojectname')
ee.Initialize(credentials)
I'm still unclear of which scripts querying the data and put the .npz files in your bucket. Did you use these codes and if yes the following issues in running them:
self.pts = pd.read_parquet(os.path.join(self.CONFIG['DATA_ROOT'], 'pts', self.version+'.parquet'))
Thank you again.
Hey thanks again! Good questions, making me realise I have quite some stuff to fix!
To answer your questions:
What are you working on exactly? I think I'm going to make a 100k sample dataset on the public Google bucket. Let me know if this would be of interest.
Ah ha! Ok, I just needed to start with download_catalog.py! Now I have some .parquet files to get somewhere...
So, I've previously worked on offshore oil spill and seep detection only with optical data. I'm interested in your code
Thanks again for your responses.
Hey @wtrainor I've made some adjustments and added a CLI that should help with generating the points. You can use it like:
python cli.py generate-points 2019-07-01 100000 100k_unlabelled --n-orbits=31
.
I'll now be making sure that the image sampling also works from the cli as well as the training, will be adding this 100k sample dataset on the Google bucket.
I like your project! Do you know the NGO SkyTruth? I think they're also using SAR for oil spill detection, could be a nice collaboration? Let me know if I can facilitate an intro.
Thanks again @Lkruitwagen ... got sidetracked to other things but have downloaded the new codes.
Super dumb question: To get S2_utm.gpkg, do I convert a kml file from say here.?
I have heard about SkyTruth for their boat detection work. Depends on how I progress.. :)
Never a dumb question! Yes I believe that's what I did. For convenience I've just copied them here: https://drive.google.com/drive/folders/1WrkSb2t_kTtuHq0uqszpo7d-NojLjk37?usp=sharing
Will also make a note in the README. Thanks!
Success on the generate-points! :) thank you so much...
Small problems on the generate-samples: I've tried
python cli.py generate-samples gcp gee Sentinel-2 --conf=DATA_CONFIG_wjtg.yaml
and get
TypeError: generate_samples() got an unexpected keyword argument 'conf'
then I tried
python cli.py generate-samples gcp gee Sentinel-2
and
python cli.py generate-samples gcp gee
and both give
TypeError: generate_samples() got an unexpected keyword argument 'destinations'
Is there something I'm doing wrong with the strings? I was guessing a bit on the NAME for the dataset.
Ah! I think there's a typo: def generate_samples(name, sources, denstinations)
. Apols, I found this yesterday, let me push an update.
Separately, would it be useful if I put development work on a separate branch? The codebase isn't stable yet, and I don't want to make changes to things you're actively using.
Do whatever is best for you! I can manage. I check differences before I pull. THANK YOU AGAIN!!!
Soooo close @Lkruitwagen! I have all my points and matches I believe. Now trying to generate-samples
python cli.py generate-samples --conf=DATA_CONFIG_wjtg.yaml Misland_v0 gee gcp
I have no tiles nor gee logs. In sample_generator.py at line 165 I get the error because _done_idx_ is empty. I tried commenting the args generator out and just calling GEE_downloader
which you had commented out on line 173. But because tiles is empty, TL's has no .keys
and therefore throws an error at line 463 in GEE_downloader.
How do I get the tiles or log files? I owe you a beer by the way for all my questions...
Yes, sorry, this is quite brittle behaviour. I'd implemented a 'notiles' case but it looks like it's not working now. Have pushed fixes, needed to use a new config param unfortunately :/ so you'll need to remake or add to your config.
Looks like about 1 in 6 tiles aren't sampling the correct utm coordinates of the target image, so they return no data and raise an error. I'm looking into why this is.
Log will propagate by themselves, so don't worry about that. They're just there if you want/need to pause downloading and restart again, then you don't need to start from the beginning and can split the remaining downloads across your workers porportionally.
At what step (script/function) could the tiles be generated? I see that you create TLs = {kk:None for kk in pts.index.values.tolist()} for a 'notiles' case but in _get_GEE_arr you are getting the utm coordinates from TLs ... Can I generate some TLs with the existing catalogs + codes? Or can I scrap the long/lat's from the point files and convert them to utm somehow?
I would really just like to get the GEE query part working. All my previous work relied on using the Python EE API for getting images then gsutil to put on GCP bucket. I think your method would be infinitely better if I can get it to run.
Hi @Lkruitwagen , I got it working! I think the main problem was some convoluted reprojections in GEEdownloader.py. I will clean it up and test on a few more locations than I can commit the changes if you like. THANK YOU again for your help. I'll keep you posted on the results.
Hi @wtrainor, that's so great to hear!!! I think I missed your message 6 days ago somehow, I'm so sorry.
Yes please, I'd love if you could commit the changes! I have a colleague who will also probably start using this repo, also to get images off of GEE, so it would be great it this could work for them as well. Thanks so much!
I tried to commit, but I don't think I have permission to commit to this repository. Sound correct?
Also, there is still some buggy behavior with results = P.starmap(GEE_downloader, args)
About every 10th sample using 3 workers, I'll get this error:
multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/opt/conda/envs/deepS/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/opt/conda/envs/deepS/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/home/jupyter/deepsentinel/deepsentinel/utils/downloaders.py", line 520, in GEE_downloader S1_arr = _get_GEE_arr( File "/home/jupyter/deepsentinel/deepsentinel/utils/downloaders.py", line 430, in _get_GEE_arr arr = np.load(io.BytesIO(pixels_content)) File "/opt/conda/envs/deepS/lib/python3.8/site-packages/numpy/lib/npyio.py", line 444, in load raise ValueError("Cannot load file containing pickled data "
If I restart python cli.py generate-samples --conf=DATA_CONFIG_wjtg.yaml Misland_v0 gee gcp then it will advance another 10 images...
Hi @wtrainor I've invited you to the repo! I think main
is still protected, but if you make your own branch we can do a Pull Request. Excited to see the commits!
Hmm yeah I'm not sure what we can do about this error. The image data is buffered into a BytesIO object so it could just be that the data isn't being retrieved properly/full from GEE. We could add an error catch and retry loop?
Ok, I made a new branch called DS_Pau, and committed the downloaders.py. I may have already committed to cli.py to the main branch (it was just a simple change of df. to dt.).
I'll probably work on the BytesIO issue later.
Hi @wtrainor cool, I see the branch! No code on it yet, can you git push
up your commits? Thanks so much!
hmmm yes it looks like, but are you sure those commits haven't just been made locally? Sorry, maybe you have, I don't use the jupyter gui for git.
ahh found a 'git push' button. Have you hit this button at the top?
ed: signing off now, will check in the AM!
Hi @Lkruitwagen, I think I got it done on the command line. Do you see it now?
Merged in #6! Thanks for pushing this code :), closing this issue now.
Hello! I've been exploring your code, and am specifically looking for examples of downloading satellite images from GEE directly to a GCP storage bucket. I'm stuck at the data subdirectory: what kind of files is the code expecting here? Looks like vae.py reads them into a dataframe, but are they actually TIFF's from GEE? or something else?
Thank you!