geoschem / input-data-catalogs

Configuration files for downloading GEOS-Chem input data with the bashdatacatalog utility.
https://github.com/geoschem/bashdatacatalog/wiki
3 stars 2 forks source link

probelm with bashdatacatalog-list #2

Open pkasibhatla opened 2 years ago

pkasibhatla commented 2 years ago

I am trying to download the chem input files for GCHP 13.4. The command bashdatacatalog-fetch InputDataCatalogs/13.4/ChemistryInputs.csv from my ExtData directory seems to work fine. Output is attached below.

But when I give the command bashdatacatalog-list -am -r 2019-06-30,2019-08-02 -f xargs-curl InputDataCatalogs/13.4/ChemistryInputs.csv | xargs curl I get a bunch of numbers on the screen. Here are the first few lines of what I see:

(gchp-openmpi-env) psk9@dcc-login-01  /work/psk9/Data/ExtData $ bashdatacatalog-list -am -r 2019-06-30,2019-08-02 -f xargs-curl InputDataCatalogs/13.4/ChemistryInputs.csv | xargs curl
curl: option --write-out: requires parameter
curl: try 'curl --help' for more information
curl: option --write-out: requires parameter
curl: try 'curl --help' for more information
curl: option --url: requires parameter
curl: try 'curl --help' for more information
curl: option --url: requires parameter
curl: try 'curl --help' for more information
curl: option --url: requires parameter
curl: try 'curl --help' for more information
curl: option -o: requires parameter
curl: try 'curl --help' for more information
curl: option --write-out: requires parameter
curl: try 'curl --help' for more information
8.621e-28 8.621e-28 8.621e-28 1.526e-26 1.526e-26 2.224e-25 2.224e-25 2.224e-25 2.700e-24 2.700e-24 1.037e-25 1.037e-25 3.833e-28 3.833e-28 3.833e-28 1.275e-30 1.275e-30 4.538e-33 4.538e-33 9.746e-35 9.746e-35 9.746e-35 1.645e-35 1.645e-35 0.000e+00 0.000e+00 0.000e+00 1.168e-36 1.168e-36 5.806e-35 5.806e-35 1.690e-34 1.690e-34 1.690e-34 3.137e-34 3.137e-34 2.057e-34 2.057e-34 4.358e-35 4.358e-35 4.358e-35 2.697e-37 2.697e-37 0.000e+00 0.000e+00 0.000e+00 
1.507e-28 1.507e-28 1.507e-28 7.446e-27 7.446e-27 4.232e-25 4.232e-25 4.232e-25 2.522e-23 2.522e-23 7.689e-25 7.689e-25 2.125e-27 2.125e-27 2.125e-27 5.511e-30 5.511e-30 1.354e-32 1.354e-32 1.325e-34 1.325e-34 1.325e-34 2.230e-36 2.230e-36 0.000e+00 0.000e+00 0.000e+00 8.795e-36 8.795e-36 1.102e-34 1.102e-34 3.159e-34 3.159e-34 3.159e-34 7.244e-34 7.244e-34 4.216e-34 4.216e-34 1.184e-35 1.184e-35 1.184e-35 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 
5.333e-28 5.333e-28 5.333e-28 5.029e-26 5.029e-26 4.204e-24 4.204e-24 4.204e-24 2.432e-22 2.432e-22 5.353e-24 5.353e-24 9.517e-27 9.517e-27 9.517e-27 1.452e-29 1.452e-29 2.247e-32 2.247e-32 4.036e-35 4.036e-35 4.036e-35 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 3.412e-35 3.412e-35 1.615e-34 1.615e-34 3.813e-34 3.813e-34 3.813e-34 8.920e-34 8.920e-34 4.539e-34 4.539e-34 5.912e-37 5.912e-37 5.912e-37 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 

The bashdatacatalog-list seems to work fine for fetching the met data and the hemco files. chem_meta.txt

LiamBindle commented 2 years ago

Hi Prasad,

I think this is an issue with the version of curl that's installed. Could you try wget instead? The commands would be

$ bashdatacatalog-list -am -r 2019-06-30,2019-08-02 -f xargs-curl InputDataCatalogs/13.4/ChemistryInputs.csv > url_download_list.txt
$ wget -i url_download_list.txt -x -nH -nv --cut-dirs=4 

The first command generates a url list, and the second command downloads the list with wget. Could you try this?


I'll take a look at fixing the -f xargs-curl format. It looks like I'm using an option that was added more recently then I thought.

pkasibhatla commented 2 years ago

Hi Liam,

Yes, I realized this may be the case after I sent my email and have been trying what you suggest (from ExtData cut-dirs=1) and so far so good.

I didnt examine the curl issue carefully but it seemed to mess up on a lot of files and sometimes looked like filenames were being switched during the dowload step.

Best, Prasad

jiaying002 commented 1 year ago

I also occurred the same "number" issue when downloaded the chem input data, but my problem cannot be solved by the wget method provided by Liam.

I followed the code provided by Liam, and when I run the wget line, I got the feedback "No URLs found in download_url_list.txt."

url_download_list.txt

jhaskinsPhD commented 1 year ago

I'm also getting a lot of these curl argument errors described above using the command:

bashdatacatalog-list -am -r "2012-06-01,2013-12-01" -f xargs-curl DataCatalogs/14.1.1/*.csv | xargs -P 4 curl

and when I generate a url list as follows:

bashdatacatalog-list -am -r 2010-09-01,2012-01-01 -f xargs-curl DataCatalogs/14.1.1/*.csv > url_download_list.txt

I get this file: url_download_list.txt

and when I try to use wget as follows: wget -i url_download_list.txt -x -nH -nv

I also get the error: No URLs found in url_download_list.txt.

Has there been any progress on this bug? I'm trying to set up my server at UUtah so I'm needing to download a lot of dif input files... What version of curl is required to not get these errors? Does anyone have an idea of how this curl -o error messes with the files downloaded? Does it indeed mess up the names as Prasad indicated?

yantosca commented 1 year ago

Hi @jhaskinsPhD, thanks for writing. I was able to replicate your error.

Am tagging @SaptSinha who may be more knowledgeable about bashdatacatalog issues than I am.

Also tagging @LiamBindle, who has since left the GEOS-Chem community, but still may have some ideas.

yantosca commented 1 year ago

@jhaskinsPhD: You might also consider using Globus Endpoint for the file transfer. I bet that U of Utah has a Globus account, you can check with your IT support staff there. Download from "GEOS-Chem data (WashU)".

jiaying002 commented 1 year ago

Hey @jhaskinsPhD , I believe I solved it with the help of this link: https://github.com/LiamBindle/bashdatacatalog/wiki/3.-Useful-Commands

I used this command to solve this problem: $ bashdatacatalog-list -am -f url catalog.csv > url_download_list.txt $ wget -i url_download_list.txt -x -nH -nv --cut-dirs=4 # you will need to modify --cut-dirs=N

The first line added url comparing to the answer of this issue.

You can also use the Globus as @yantosca said by those commands: $ bashdatacatalog-list -am -f globus="$(pwd),/remote-data-root/" catalog.csv > globus_batch.txt $ globus transfer --batch globus_batch.txt SOURCE_ENDPOINT_ID DEST_ENDPOINT_ID

Hope this helps!

yantosca commented 1 year ago

Thanks @jiaying002 for the feedback on this issue!

yctrrr commented 7 months ago

For anyone who may be confused about the wget method. The right way to do this seems to be: _1.bashdatacatalog-list -am -r 2019-06-30,2019-08-02 -f url InputDataCatalogs/13.4/ChemistryInputs.csv > url_downloadlist.txt where the argument -f url means url links instead of xargs curl _2.wget -i url_downloadlist.txt -x -nH -nv --cut-dirs=1 -x will create a hierarchy of directories by urls. -nH will remove host-prefixed directories (geoschemdata.wustl.edu in this case). The setting of --cut-dirs = ? will depend on the location of your download txt. It will allow you to cut the component of the dirctories. e.g. --cut-dirs=1 will remove ExtData/ in ExtData/CHEM_INPUTS/

You will also need to repeat the methods above whenever new requests are needed to update the downloading urls.