NEFSC / READ-SSB-CHAJI-Effort-Displacement---Scallop

Other
0 stars 0 forks source link

data extracting and data processing is super janky #200

Open mle2718 opened 1 year ago

mle2718 commented 1 year ago

data extracting and processing is really slow. bad for replicability.

  1. There's a query to DMIS (wind_test) that pulls everything. This is too much. Since we are just looking at trips that land scallops, we could pull the rows corresponding to tripids where scallop>=1. There are 4.5M rows in wind_test. But there are only ~360,000 rows that we care about. I think we want this:
select * from APSD.DMIS_WIND_TEST@garfo_nefsc where docid in (
select distinct docid FROM APSD.DMIS_WIND_TEST@garfo_nefsc where nespp3=800);
  1. The DMIS query has a similar issue, but it's harder to subset on just trips that landed 800. It also takes much less time to run this query, so maybe we don't have to improve it. This should work:
    SELECT TRIP_ID, DOCID, ACTIVITY_CODE, DAS_ID, LGC_A, LGC_B, LGC_C, SC_2, SC_3, SC_4, SC_5, SC_6, SC_7, SC_8, SC_9, SG_1A, SG_1B
    FROM APSD.t_ssb_trip_current@garfo_nefsc where trip_id in (
    SELECT distinct trip_id 
    FROM APSD.t_ssb_catch_current@garfo_nefsc where nespp3=800);
mle2718 commented 1 year ago

The copy-over is a little sketchy too. If the "copy-over" option is done, it will change the data vintage from the date the file was extracted to set things up for the data_processing step. There are 4 files that go into the data processing.

Scallop_Linkingorg <- readRDS(here("data","intermediate",paste0("Scallop_Linkingorg_",vintage_string,".Rds")))
RESULT_COMPILED <- readRDS(here("data","intermediate",paste0("RESULT_COMPILED_",vintage_string,".Rds")))
APSD_DMIS_2 <- readRDS(here("data","intermediate",paste0("APSD_DMIS_2_",vintage_string,".Rds")))
all_yrs_costs <- readRDS(here("data","intermediate",paste0("all_yrs_costs_",vintage_string,".Rds")))

And that file expects them all to have the same value of vintage_string.

possible solution: modify reset_vintage_string chunk to up all four vintage strings. And then set the output one to either 'today', the earliest, or the latest of the four?

mchaji commented 1 year ago

I think this was fixed by _data_processingsteps.R or do we need to keep this issue open?

mle2718 commented 1 year ago

@mchaji I'm not sure. Let's leave it open for now.