Open amael-ls opened 8 years ago
I ran into this with pin ban and a few other spp. Looks like some records just didn't get TSNs associated with them. It's probably safe to just update the NAs to the correct species, no?
I agree, updating is the best idea.
Miranda
Sent from my iPad
On Sep 12, 2016, at 1:30 PM, Matthew Talluto notifications@github.com wrote:
I ran into this with pin ban and a few other spp. Looks like some records just didn't get TSNs associated with them. It's probably safe to just update the NAs to the correct species, no?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
The potential problematic species: "NA-ACA-ANE" "NA-CAR-CAR" "NA-CAR-OVA" "NA-CHA-NA" "NA-CYA-NA" "NA-EUG-PAL" "NA-EUG-STE" "NA-HED-NA" "NA-LIQ-STY" "NA-MAL-NA" "NA-MOR-NA" "NA-PAR-NA" "NA-PIN-BAN" "NA-PLA-NA" "NA-PRI-LAN" "NA-PSY-MAR" "NA-PTE-MAC" "NA-QUE-MAR" "NA-QUE-PRI"
Here is the small function I used to detect them (it is a quick and dirty solution, sorry listProblem.R.zip
)
Ok, after comparing latin names, I found that only two species are the same:
Therefore they can be merged
I think there are a few issues going on here. For Pin ban and Liq Sty, there are TSNs for some records and not for others, so the NA records need to be updated to point to the right species key. For others, TSNs (and in some cases, specific epithets) are missing entirely.
For the missing epithets (records ending in -NA), we should verify from the raw data if possible that these records were only genus level observations.
For others, we should add TSNs when they are available. If the species is not listed in ITIS, we should check for synonyms and use the TSN for the synonym.
Another issue (which might not be one...): There are some semicolon in the english name of some species. Therefore read.table (and friends) from R cannot read them because the separator is also semicolon. Here is a C++ function that detect where there are some problems. On the file "final_ref_table.csv", I found 78 problems (run the function to have the lines). Example line 11: 18032;"Abies";"balsamea";"Balsam fir ;balsam fir";"Sapin baumier";"SAB";20;"Bf";12;5;"18032-ABI-BAL"
read.table handles this fine on my machine. The quotes protect the extra semicolon. Depending on your version/localization of R, you may have to set sep=";", quote='"'
For the missing epithets (records ending in -NA), we should verify from the raw data if possible that these records were only genus level observations. From @mtalluto
Yes, you're right. This is the decision we took. Those species have only a genus.
As you suggested, I have to update the first NA value in species code string for the right TSN (when it's possible). We still have too keep in mind than on ~2500 total species in the ref_species
table only ~200 species are present in the QUICC-FOR database.
It seems that some species have synonyms, maybe this is why you could not find TSN code. Example: NA-CAR-ALB; Carya alba; Carya tomentosa NA-CHA-NOO; Chamaecyparis nootkatensis; Cupressus nootkatensis (changed in 1993) NA-QUE-PRI; Quercus prinus L.; Quercus montana NA-TAX-ASC; Taxodium ascendens; Taxodium distichum var. imbricarium (or var. nutans??)
cf ITIS website: http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=183433
Yes, this is exactly it. I don't have access to the database from here (I think?), so I can't make the change. You'll have to buy Steve a beer and he can do it :)
In file final_ref_table.csv, 2 names for Pinus banksiana (cf lines 1150 and 2276):