Open tucotuco opened 6 years ago
Thanks John - I’ll reach out to iDigBio again and see if ALA will join us now.
Sent from Shoe (my iPhone)
On Jun 4, 2018, at 2:04 PM, John Wieczorek notifications@github.com<mailto:notifications@github.com> wrote:
Our distinct value lists from 2017 are more than a year old now. We intended to try to make annual copies of these, so any time now will be good to gather these again.
John can do this for VertNet and request it of GBIF.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-2Dqa-2Dmanage_issues_39&d=DwMCaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=kfMMH9WUeRccRypYVmvAUyDkthdOMqe2-Ckt4WFxESQ&s=gOiMDaXKQoRzLMuPv35SsNUZcTV3JtCC2hg1Tkool_Q&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC2gS0y10LDc190KtR7LdO-2DgpR6t2kChks5t5aDWgaJpZM4UZx9g&d=DwMCaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=kfMMH9WUeRccRypYVmvAUyDkthdOMqe2-Ckt4WFxESQ&s=JorR4kwi5i9hOvcmKUo8fgCT4hPaOJsjcxzrvp9GFOs&e=.
Thanks Deb: I have notified our lead ALA developer Nick Dosremedios about this.
I have made the request to Tim Robertson at GBIF who put them together for us last time. I'll work on the VertNet values.
I'm happy to generate such lists from the ALA. Is there is set of DwC fields (and/or non-DwC fields) that I can use to query with?
Thanks, Nick. We have been trying to gather distinct value lists for terms (with count) for Occurrences that might benefit from controlled vocabularies. Here is a list of what others have been summarizing:
basisOfRecord continent countrycode country day disposition establishmentMeans geodeticDatum georeferenceVerificationStatus identificationQualifier identificationVerificationStatus islandGroup island language license lifeStage month nomenclaturalCode occurrenceStatus organismScope preparations reproductiveCondition sex taxonRank taxonomicStatus typeStatus type verbatimSRS waterbody
It looks like iDigBio also added some indexed versions of terms for comparisons of interest ( https://github.com/tdwg/dwc-qa/tree/master/data/idigbioDistinctValues).
And here is an example csv from last year from VertNet for basisOfRecord with header to include DwC term name and "reps" as the number of Occurrences it appeared in:
On Tue, Jun 5, 2018 at 2:24 AM, Nick dos Remedios notifications@github.com wrote:
I'm happy to generate such lists from the ALA. Is there is set of DwC fields (and/or non-DwC fields) that I can use to query with?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39#issuecomment-394585521, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP6zxz1rSfzNvZR9FwmFS9eGN6Jg0Wks5t5hX-gaJpZM4UZx9g .
Yes please Nick, if you are also willing/able to share the indexes versions - that would be great. These are super useful for helping people to understand indexing...(and more).
Excited to have you on board. Thank you.
Sent from Shoe (my iPhone)
On Jun 5, 2018, at 8:41 AM, John Wieczorek notifications@github.com<mailto:notifications@github.com> wrote:
Thanks, Nick. We have been trying to gather distinct value lists for terms (with count) for Occurrences that might benefit from controlled vocabularies. Here is a list of what others have been summarizing:
basisOfRecord continent countrycode country day disposition establishmentMeans geodeticDatum georeferenceVerificationStatus identificationQualifier identificationVerificationStatus islandGroup island language license lifeStage month nomenclaturalCode occurrenceStatus organismScope preparations reproductiveCondition sex taxonRank taxonomicStatus typeStatus type verbatimSRS waterbody
It looks like iDigBio also added some indexed versions of terms for comparisons of interest ( https://github.com/tdwg/dwc-qa/tree/master/data/idigbioDistinctValues).
And here is an example csv from last year from VertNet for basisOfRecord with header to include DwC term name and "reps" as the number of Occurrences it appeared in:
On Tue, Jun 5, 2018 at 2:24 AM, Nick dos Remedios notifications@github.com<mailto:notifications@github.com> wrote:
I'm happy to generate such lists from the ALA. Is there is set of DwC fields (and/or non-DwC fields) that I can use to query with?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39#issuecomment-394585521, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP6zxz1rSfzNvZR9FwmFS9eGN6Jg0Wks5t5hX-gaJpZM4UZx9g .
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-2Dqa-2Dmanage_issues_39-23issuecomment-2D394758328&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=4fmrlYR4O1sWq4nuVvWARPa1S_owtOvt2zdMaUbwix0&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC2gS2iZODxdumArs088hgkMbyAk4P6gks5t5qafgaJpZM4UZx9g&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=1L4sC4obQveXdcwfAUyp1dAqMS1lwjoXGvdywK89ApI&e=.
It might also be interesting for all of us to add distinct values for the year term.
On Tue, Jun 5, 2018 at 12:51 PM, Debbie Paul notifications@github.com wrote:
Yes please Nick, if you are also willing/able to share the indexes versions - that would be great. These are super useful for helping people to understand indexing...(and more).
Excited to have you on board. Thank you.
Sent from Shoe (my iPhone)
On Jun 5, 2018, at 8:41 AM, John Wieczorek <notifications@github.com< mailto:notifications@github.com>> wrote:
Thanks, Nick. We have been trying to gather distinct value lists for terms (with count) for Occurrences that might benefit from controlled vocabularies. Here is a list of what others have been summarizing:
basisOfRecord continent countrycode country day disposition establishmentMeans geodeticDatum georeferenceVerificationStatus identificationQualifier identificationVerificationStatus islandGroup island language license lifeStage month nomenclaturalCode occurrenceStatus organismScope preparations reproductiveCondition sex taxonRank taxonomicStatus typeStatus type verbatimSRS waterbody
It looks like iDigBio also added some indexed versions of terms for comparisons of interest ( https://github.com/tdwg/dwc-qa/tree/master/data/idigbioDistinctValues).
And here is an example csv from last year from VertNet for basisOfRecord with header to include DwC term name and "reps" as the number of Occurrences it appeared in:
https://github.com/tdwg/dwc-qa/blob/master/data/VNDistinctValues/VertNet_ distinct_basisOfRecord_2017-02-14.csv
On Tue, Jun 5, 2018 at 2:24 AM, Nick dos Remedios < notifications@github.commailto:notifications@github.com> wrote:
I'm happy to generate such lists from the ALA. Is there is set of DwC fields (and/or non-DwC fields) that I can use to query with?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39# issuecomment-394585521, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAcP6zxz1rSfzNvZR9FwmFS9eGN6Jg0Wks5t5hX-gaJpZM4UZx9g .
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense. proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc- 2Dqa-2Dmanage_issues_39-23issuecomment-2D394758328&d=DwMFaQ&c= HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m= A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=4fmrlYR4O1sWq4nuVvWARPa1S_ owtOvt2zdMaUbwix0&e=, or mute the threadhttps://urldefense. proofpoint.com/v2/url?u=https-3A__github.com_notifications_ unsubscribe-2Dauth_AC2gS2iZODxdumArs088hgkMbyAk4P6gks5t5qafgaJpZM4UZx9g&d= DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m= A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s= 1L4sC4obQveXdcwfAUyp1dAqMS1lwjoXGvdywK89ApI&e=.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39#issuecomment-394762045, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP67uOBLXaaHY2LIz2cJkzc4oibmTMks5t5qkNgaJpZM4UZx9g .
That would be most useful, instructive, and entertaining Deb
Sent from Shoe (my iPhone)
On Jun 5, 2018, at 9:06 AM, John Wieczorek notifications@github.com<mailto:notifications@github.com> wrote:
It might also be interesting for all of us to add distinct values for the year term.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-2Dqa-2Dmanage_issues_39-23issuecomment-2D394767147&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=gQ9fkQr09XHiG6nGXg1Af-5pmw71ILxlpRaa19i8e5g&s=CShAGMQZNZnfaHvIxPfUavn0X7zPsfgw0TS4U_DTwpo&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC2gS1Y-2Dltew0zbRRozB0UduuvA6KxG4ks5t5qx2gaJpZM4UZx9g&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=gQ9fkQr09XHiG6nGXg1Af-5pmw71ILxlpRaa19i8e5g&s=rUyjA9XWG7_LTMCRqPGWsGeiS6ZAKKDHJ44vVQWNsgE&e=.
VertNet distinct values added in commit https://github.com/tdwg/dwc-qa/commit/449824b992c74a351e94b3f4d4b6330fb5711e86.
I've managed to pull out unique values for a subset of fields from the ALA SOLR index. We don't index all fields, so the missing fields might be able to be generated via a Cassandra (I don't know how to). I figured this subset would be a good start and our next major release should include all DwC fields (we're moving to a clustered architecture to handle the bigger data).
Should I attach the TXT file to this issue or commit it to a directory or another repo - I noticed the comment above references a commit that is not linked in this repo, so wanted to check first.
Edit: ZIP file with shell script and output from script
fields used: basis_of_record country_code country month year establishment_means raw_identification_qualifier license occurrence_status_s reproductive_condition_s raw_sex rank type_status
Hi Nick, That's great. If you clone or fork the tdwg/dwc-qa repository, create a new branch, add a folder for ALA, add the files to that folder, commit, push and make a pull request, that would be ideal.
On 22:39, Tue, Jul 3, 2018 Nick dos Remedios notifications@github.com wrote:
I've managed to pull out unique values for a subset of fields from the ALA SOLR index. We don't index all fields, so the missing fields might be able to be generated via a Cassandra (I don't know how to). I figured this subset would be a good start and our next major release should include all DwC fields (we're moving to a clustered architecture to handle the bigger data).
Should I attach the TXT file to this issue or commit it to a directory or another repo - I noticed the comment above references a commit that is not linked in this repo, so wanted to check first.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39#issuecomment-402337929, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP63LZUJH6xSxr2_2pcG8dx1pXsATmks5uDBzUgaJpZM4UZx9g .
Hi @tucotuco, I've created another PR with some changes, including the suggested readme file, using sub-directories with date, as well as indicating "index" values in the file name, similar to how iDigBio does it.
Our distinct value lists from 2017 are more than a year old now. We intended to try to make annual copies of these, so any time now will be good to gather these again.
John can do this for VertNet and request it of GBIF.