Open seananderson opened 2 years ago
I'd defer to you and AFSC folks on what to do about passes = 1 or NA. I think the same issue would be true with performance? This either should be 0/1 (as integer or factor) or some other classification? Re: typing, I have pass_id as an integer, but could change it
Re: scientific name, I wasn't sure where to pull itis from, but sure -- that's fine with me
Agree on all the other stuff:
On the trawl_id being unique across regions, I hadn't thought about that. It's unique across the NWFSC surveys -- but those may overlap with BC or GOA surveys. Not a problem if we're joining on survey and event_id -- though another option would be to change the event_id to concatenate survey name and event id, e.g. "SYNQCS_308673"
Easiest way to look up ITIS is via https://cran.r-project.org/web/packages/taxize/index.html
That only works if there aren't things like "juvenile species A" in data we want.
We probably have something we could call performance code eventually (if we include tows that have been discarded from the official index calculations for various reasons).
I thought about "SYNQCS_308673". One slight downside might be that I think that will make the file sizes larger over numeric (same for species name vs. ITIS code?). Plus, it would make it harder to use these datasets and then join them onto our internal datasets. That's probably the best reason not to.
Sorry it took a while to get to this. I agree with all these suggestions and tried to incorporate them into my afsc code.
A few things I'll need to follow up on, as I tried this for the first time with our new API, which is amazing and makes things super simple such that anyone can run the code with no special access permissions. However, as a result, there are a few tiny I/Emily Markowitz will need to change on the API side to jive with the rest of our conventions:
Lastly, a note on itis/taxon identification. In our database we have a qualititative (low, med, high) rating for our confidence in the species identification. Maybe we should include something like this? Seems probably impossible to standardize across surveys though?
Ah, and the other thing is that the data served by the API already have joined the haul and catch data, which we were probably avoiding initially due to file sizes? I can separate if needed.
Some thoughts as I went through the exercise for PBS:
Should we ensure that cached data always have the same column classes?
E.g.
Sometimes that would mean things like this:
What should we do in cases where some data type is not relevant. E.g. all our surveys have one pass. Should that be NA or 1?
Should
trawl_id
beevent_id
so it can expand to longline data eventually?Similarly, should
area_swept
beeffort
andarea_swept_units
beeffort_units
?Are we OK with
event/trawl_id
not being guaranteed to be unique across surveys? I think that's OK, as long as one always joins on survey and event ID.Should
scientific_name
beitis
(just for internally cached data) to ensure cached data have been turned into a standard common currency? As a user, the functions would always add scientific name and common name from a standard table.Do we want date to be of class date-time or just date? At first I was thinking just date for simplicity and probably to save a bit of file size, but then perhaps time is interesting sometimes ecologically.
Should
total_catch_numbers
becatch_numbers
and shouldtotal_catch_wt_kg
becatch_weight
? Then the haul table could havecatch_weight_units
... or best yet, we always use kg and skip the extra column? In that case I could go forcatch_weight
orcatch_weight_kg
.