Cached data format notes

seananderson commented 2 years ago

Some thoughts as I went through the exercise for PBS:

Should we ensure that cached data always have the same column classes?

E.g.

Sometimes that would mean things like this:

haul$pass <- NA_integer_

What should we do in cases where some data type is not relevant. E.g. all our surveys have one pass. Should that be NA or 1?

Should trawl_id be event_id so it can expand to longline data eventually?

Similarly, should area_swept be effort and area_swept_units be effort_units?

Are we OK with event/trawl_id not being guaranteed to be unique across surveys? I think that's OK, as long as one always joins on survey and event ID.

Should scientific_name be itis (just for internally cached data) to ensure cached data have been turned into a standard common currency? As a user, the functions would always add scientific name and common name from a standard table.

Do we want date to be of class date-time or just date? At first I was thinking just date for simplicity and probably to save a bit of file size, but then perhaps time is interesting sometimes ecologically.

Should total_catch_numbers be catch_numbers and should total_catch_wt_kg be catch_weight? Then the haul table could have catch_weight_units... or best yet, we always use kg and skip the extra column? In that case I could go for catch_weight or catch_weight_kg.

ericward-noaa commented 2 years ago

I'd defer to you and AFSC folks on what to do about passes = 1 or NA. I think the same issue would be true with performance? This either should be 0/1 (as integer or factor) or some other classification? Re: typing, I have pass_id as an integer, but could change it

Re: scientific name, I wasn't sure where to pull itis from, but sure -- that's fine with me

Agree on all the other stuff:

ensure same column classes across datasets.
area_swept -> effort
trawl_id -> event_id
total_catch_numbers -> catch_numbers
total_catch_wt_kg -> catch_weight
Add catch weight units as a column
Add effort units as column
Even though we don't use time, others might -- so I think it's good to include

On the trawl_id being unique across regions, I hadn't thought about that. It's unique across the NWFSC surveys -- but those may overlap with BC or GOA surveys. Not a problem if we're joining on survey and event_id -- though another option would be to change the event_id to concatenate survey name and event id, e.g. "SYNQCS_308673"

seananderson commented 2 years ago

Easiest way to look up ITIS is via https://cran.r-project.org/web/packages/taxize/index.html

That only works if there aren't things like "juvenile species A" in data we want.

We probably have something we could call performance code eventually (if we include tows that have been discarded from the official index calculations for various reasons).

I thought about "SYNQCS_308673". One slight downside might be that I think that will make the file sizes larger over numeric (same for species name vs. ITIS code?). Plus, it would make it harder to use these datasets and then join them onto our internal datasets. That's probably the best reason not to.

Lewis-Barnett-NOAA commented 2 years ago

Sorry it took a while to get to this. I agree with all these suggestions and tried to incorporate them into my afsc code.

A few things I'll need to follow up on, as I tried this for the first time with our new API, which is amazing and makes things super simple such that anyone can run the code with no special access permissions. However, as a result, there are a few tiny I/Emily Markowitz will need to change on the API side to jive with the rest of our conventions:

Haul is currently only specific to each cruise, will change to unique code
Itis (and worms) is already linked on the data side, but not in the public API, so for now this is the scientific name
only start lat/lon was available by API, so I left end coords as NA
Haul performance left as NA because the public API only pulls "satisfactory" or better hauls...which may be a good filter to be honest if we let this loose on the public

Lastly, a note on itis/taxon identification. In our database we have a qualititative (low, med, high) rating for our confidence in the species identification. Maybe we should include something like this? Seems probably impossible to standardize across surveys though?

Lewis-Barnett-NOAA commented 2 years ago

Ah, and the other thing is that the data served by the API already have joined the haul and catch data, which we were probably avoiding initially due to file sizes? I can separate if needed.

DFO-NOAA-Pacific / surveyjoin

Cached data format notes #1