Use of bashdatacatalog to find invalid files

cbutenhoff commented 4 months ago

Your name

Chris Butenhoff

Your affiliation

Portland State University

Please provide a clear and concise description of your question or discussion topic.

I recently used bashdatacatalog to download input files for GCHP v14.3.1 for a multi-year simulation. The download took a cpl of days and when finished I noticed some (many?? ) of the file names were corrupted.

For example, HEMCO/OFFLINE_BIOVOC/v2021-12/0.5x0.625/2006/01/biovoc_05.20060103.nc actually is biovoc_05.20060102.nc according to the nc header info; MERRA2.20070101.I3.05x0625.nc4 is actually MERRA2.20070101.A3mstE.05x0625.nc4, and so on.

I believe this happened because I used the parallel option in xargs -P curl to download the files, and some communication/timing error occurred.

I would like to not download all the files again. I notice that bashdatacatalog-list has the -w option to identify files with incorrect checksums. I tried this to identify files that I know are invalid but bashdatacatalog-list was unable to identify those files.

Here is my usage to find the corrupt biovoc files:

> bashdatacatalog-list -aw -p "OFFLINE_BIOVOC/v2021-12/0.5x0.625/2006/01" InputDataCatalogs/**/*.csv

I run it in my ExtData directory as I did when I downloaded the files. I have also tried running using pattern "biovoc" but it didn't return any file names either.

I don't know too much how checksums work. In the case where the file is intact but has the wrong filename, would the checksum still match?

Thanks for any help you can provide.

cbutenhoff commented 4 months ago

As a follow-up, I was able to write a Python script that renamed the MERRA2 files based on the real file name listed under global attributes in the netCDF metadata. Unfortunately, the metadata in the HEMCO netCDF files does not provide the file name in a consistent format so renaming these files will be more difficult.

yidant commented 4 months ago

Thanks for pointing this out @cbutenhoff. I didn't encounter this issue with xargs -P curl before. Could you let us know how many streams you used to download the data?

We use MD5 checksums, which only verify the content of the file, not the file name.

Unfortunately, the metadata format is different across collections as they are from different sources. Perhaps you can try extracting the key information with regular expressions.

cbutenhoff commented 4 months ago

Thanks @yidant. At different times I used 4 and 8 streams. I'm not positive the parallel download caused the problem, but files I downloaded using 'wget' seem fine.

In some (most?) of the HEMCO nc files, there is a 'history' attribute that contains the actual file name, though it's not consistently located. I'm trying to do some checks based on this.

In the end I'll probably spend more time trying to rename corrupt filenames that I would have just redownloading all the input data :).

cbutenhoff commented 4 months ago

This comment may be better placed as its own issue, but I noticed the data catalog for GCHPv14.3.0 I believe incorrectly includes 2022 data in the GFED4/v2023-03/2023 folder:

./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_3hrfrac_gen.025x025.202212.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_dailyfrac_gen.025x025.202212.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202201.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202202.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202203.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202204.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202205.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202206.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202207.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202208.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202209.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202210.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202211.nc
./HEMCO/GFED4/v2023-03/2023/GFED4_gen.025x025.202212.nc

yidant commented 2 months ago

Hi @cbutenhoff!

I think your first issue is similar to this issue (https://github.com/geoschem/GCHP/issues/438#issuecomment-2356742027). After looking into it, we found this issue results from the xargs curl command failing to process the multi-line downloading list generated by bashdatacatalog. In addition to -P for parallel streams, you could use the -L 1 to specify that only one line of input should be passed to curl at a time. You could use commands like xargs -L 1 -P 4 curl instead.

Thanks for reporting the second issue! This new checksum file have been generated. It should be fixed.

github-actions[bot] commented 6 days ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the issue from closing this issue.

geoschem / GCHP