esturdivant-usgs / science-base-automation

Automating large USGS ScienceBase data releases
4 stars 2 forks source link

Make the method for matching data to the XML more flexible #34

Open esturdivant-usgs opened 6 years ago

esturdivant-usgs commented 6 years ago

Right now it matches the XML filename with or without a '_meta' prefix.

It might be better to let the user specify a filename convention to use to match data to XML. e.g. number of characters to match from the beginning of the filename. Alternatively, allow the user to input a regex (regular expression) string for matching.

esturdivant-usgs commented 6 years ago

from Ellyn:

My but I was inconsistent in my naming rules :> The images follow one "rule" and the GCPs etc another, and the 1st N characters wouldn't work for differentiating the GCPs from transects- they'd need to be the last N chars... I doubt multiple regex would be easy to get to work...

Maybe exact match or some regex as your condition?

esturdivant-usgs commented 6 years ago
def upload_files_matching_xml(sb, item, xml_file, max_MBsize=2000, replace=True, verbose=False):
    # Upload all files matching the XML filename to SB page.
    # E.g. xml_file = 'path/data_name.ext.xml' will upload all files beginning with 'data_name'
    # optionally remove all present files
    if replace:
        # Remove all files (and facets) from child page
        item = remove_all_files(sb, item, verbose)
    # List all files matching XML
    dataname = xml_file.split('.')[0]
    dataname = dataname.split('_meta')[0]
    # up_files = glob.glob(searchstr)
    up_files = [fn for fn in glob.iglob(dataname + '*')
                if not fn.endswith('_orig')]
    bigfiles = []
    for f in up_files:
        if os.path.getsize(f) > max_MBsize*1000000: # convert megabytes to bytes
            bigfiles.append(os.path.basename(f))
            up_files.remove(f)
    # Upload all files pertaining to data to child page
    if verbose:
        print("UPLOADING: files matching '{}'".format(os.path.basename(dataname + '*')))
        if len(bigfiles)>0:
            print("**TO DO** File {} is to big to upload here. Please manually upload afterward.".format(bigfiles))
    item = sb.upload_files_and_upsert_item(item, up_files) # upsert should "create or update a SB item"
    if verbose:
        print("UPLOAD COMPLETED.")
    return item, bigfiles
esturdivant-usgs commented 6 years ago

Simple solution: change the XML filename to match the data.

However, this would not address the problem of having multiple zip files with slightly different names that match one XML. To address this I could add the ability to search only based on a certain number of beginning characters. OR...

I could have a separate process for datasets in this category: multiple large zip files to accompany a single XML.

Change this:

    dataname = xml_file.split('.')[0]
    dataname = dataname.split('_meta')[0]

to this:

dataname = xml_file.split('.')[0]
dataname = dataname.split('_meta')[0] # will work on _metadata also
dataname = dataname[:number_of_chars_in_data_prefix]
esturdivant-usgs commented 6 years ago

Another challenge currently: if the filename is the same before either . or _meta is the same, and it just has an additional prefix, those files will be uploaded also. E.g. The page created for ubw_meta.xml (or ubw.tif.xml) will also hold ubw_test.xml.

Another possible work-around is to use a field in the metadata to specify the data filenames.

esturdivant-usgs commented 5 years ago

commit 14840641bd8f4de690e454d3acdf4222a32b31aa