Closed patrickjwright closed 9 months ago
Proposed solution for getL0tx:
Call t = time.time()
at the beginning of getL0tx
to get the time that the script starts.
Then after the tx
fetching is complete, sort only files that have been recently modified (i.e. modified after time.time()
). So something like this:
if out_dir is not None:
# Sort L0tx files and add tails
for f in glob(out_dir+'/*.txt'):
# Sort lines in recent L0tx files
if os.path.getmtime(f) > t:
sortLines(f, out_pn)
For the joining in joinL3
, I would suggest this could be implemented in our l3_processor.sh
script instead.
echo "Running AWS L3 RAW and TX joiner"
cd /mnt/data/aws/pypromice_aws/aws-l3
echo "Finding all unique AWS data by name..."
names=$(find . -maxdepth 3 -type f -name "*.csv" | cut -d"/" -f3 | sort | uniq)
echo $names
echo "Joining L3 data..."
parallel --will-cite --bar --xapply ' ' "joinL3 -s ./raw/{}/{}_10min.csv -t ./tx/{}/{}_hour.csv -o level_3 -v ../pypromice/src/pypromice/variables.csv -m ../pypromice/src/pypromice/metadata.csv -d raw" ::: $names
Right now, we merely find all unique station names and then join them regardless of time modified. But I think we could solve this simply by specifying a minimum modification timestamp in the name fetching step. Something along these lines:
names=$(find . -maxdepth 3 -type f -name "*.csv" -mmin -60 | cut -d"/" -f3 | sort | uniq)
I think 60 minutes would be enough right? It could be longer if needed.
We now perform joinL3
only on files with recent modifications in the last 2 hours, as follows:
names=$(find . -maxdepth 3 -type f -name "*.csv" -mmin -120 | cut -d"/" -f3 | sort | uniq)
Relevant code section starts with:
As currently implemented, we sort and write back to disk all .txt files. Then, the next section in
l3_processor.sh
finds that all files have been updated, and sends all files throughgetL3
(despite the logic in place that identifies only modified files).If a station has not received a new transmission, this makes for redundant and unnecesarry processing time.
Likewise,
joinL3
could be modified to only run for modified raw and/or transmitted files (potentially a separate issue could be opened for this).