getL0tx, only sort updated aws-l0/tx files

patrickjwright commented 1 year ago

Relevant code section starts with:

    if out_dir is not None:
        # Sort L0tx files and add tails    
        for f in glob(out_dir+'/*.txt'):

As currently implemented, we sort and write back to disk all .txt files. Then, the next section in l3_processor.sh finds that all files have been updated, and sends all files through getL3 (despite the logic in place that identifies only modified files).

If a station has not received a new transmission, this makes for redundant and unnecesarry processing time.

Likewise, joinL3 could be modified to only run for modified raw and/or transmitted files (potentially a separate issue could be opened for this).

PennyHow commented 1 year ago

Proposed solution for getL0tx:

Call t = time.time() at the beginning of getL0tx to get the time that the script starts.
Then after the tx fetching is complete, sort only files that have been recently modified (i.e. modified after time.time()). So something like this:

if out_dir is not None:

    # Sort L0tx files and add tails    
    for f in glob(out_dir+'/*.txt'):

        # Sort lines in recent L0tx files
        if os.path.getmtime(f) > t:
            sortLines(f, out_pn)

PennyHow commented 1 year ago

For the joining in joinL3, I would suggest this could be implemented in our l3_processor.sh script instead.

echo "Running AWS L3 RAW and TX joiner"
cd /mnt/data/aws/pypromice_aws/aws-l3

echo "Finding all unique AWS data by name..."
names=$(find . -maxdepth 3 -type f -name "*.csv" | cut -d"/" -f3 | sort | uniq)
echo $names

echo "Joining L3 data..."
parallel --will-cite --bar --xapply ' ' "joinL3 -s ./raw/{}/{}_10min.csv -t ./tx/{}/{}_hour.csv -o level_3 -v ../pypromice/src/pypromice/variables.csv -m ../pypromice/src/pypromice/metadata.csv -d raw" ::: $names

Right now, we merely find all unique station names and then join them regardless of time modified. But I think we could solve this simply by specifying a minimum modification timestamp in the name fetching step. Something along these lines:

names=$(find . -maxdepth 3 -type f -name "*.csv" -mmin -60 | cut -d"/" -f3 | sort | uniq)

I think 60 minutes would be enough right? It could be longer if needed.

PennyHow commented 9 months ago

We now perform joinL3 only on files with recent modifications in the last 2 hours, as follows:

names=$(find . -maxdepth 3 -type f -name "*.csv" -mmin -120 | cut -d"/" -f3 | sort | uniq)

GEUS-Glaciology-and-Climate / pypromice

getL0tx, only sort updated aws-l0/tx files #88