GEUS-Glaciology-and-Climate / pypromice

Process AWS data from L0 (raw logger) through Lx (end user)
https://pypromice.readthedocs.io
GNU General Public License v2.0
12 stars 4 forks source link

getL0tx, only sort updated aws-l0/tx files #88

Closed patrickjwright closed 9 months ago

patrickjwright commented 1 year ago

Relevant code section starts with:

    if out_dir is not None:
        # Sort L0tx files and add tails    
        for f in glob(out_dir+'/*.txt'):

As currently implemented, we sort and write back to disk all .txt files. Then, the next section in l3_processor.sh finds that all files have been updated, and sends all files through getL3 (despite the logic in place that identifies only modified files).

If a station has not received a new transmission, this makes for redundant and unnecesarry processing time.

Likewise, joinL3 could be modified to only run for modified raw and/or transmitted files (potentially a separate issue could be opened for this).

PennyHow commented 1 year ago

Proposed solution for getL0tx:

if out_dir is not None:

    # Sort L0tx files and add tails    
    for f in glob(out_dir+'/*.txt'):

        # Sort lines in recent L0tx files
        if os.path.getmtime(f) > t:
            sortLines(f, out_pn)
PennyHow commented 1 year ago

For the joining in joinL3, I would suggest this could be implemented in our l3_processor.sh script instead.

echo "Running AWS L3 RAW and TX joiner"
cd /mnt/data/aws/pypromice_aws/aws-l3

echo "Finding all unique AWS data by name..."
names=$(find . -maxdepth 3 -type f -name "*.csv" | cut -d"/" -f3 | sort | uniq)
echo $names

echo "Joining L3 data..."
parallel --will-cite --bar --xapply ' ' "joinL3 -s ./raw/{}/{}_10min.csv -t ./tx/{}/{}_hour.csv -o level_3 -v ../pypromice/src/pypromice/variables.csv -m ../pypromice/src/pypromice/metadata.csv -d raw" ::: $names

Right now, we merely find all unique station names and then join them regardless of time modified. But I think we could solve this simply by specifying a minimum modification timestamp in the name fetching step. Something along these lines:

names=$(find . -maxdepth 3 -type f -name "*.csv" -mmin -60 | cut -d"/" -f3 | sort | uniq)

I think 60 minutes would be enough right? It could be longer if needed.

PennyHow commented 9 months ago

We now perform joinL3 only on files with recent modifications in the last 2 hours, as follows:

names=$(find . -maxdepth 3 -type f -name "*.csv" -mmin -120 | cut -d"/" -f3 | sort | uniq)