VertNet / dwc-qa-manage

Repository to handle the management of the tdwg/dwc-qa input, keeping it separate from the input itself to shield subscribers from irrelevant issues.
Apache License 2.0
3 stars 0 forks source link

Gather 2018 Distinct Value lists #39

Open tucotuco opened 6 years ago

tucotuco commented 6 years ago

Our distinct value lists from 2017 are more than a year old now. We intended to try to make annual copies of these, so any time now will be good to gather these again.

John can do this for VertNet and request it of GBIF.

debpaul commented 6 years ago

Thanks John - I’ll reach out to iDigBio again and see if ALA will join us now.

Sent from Shoe (my iPhone)

On Jun 4, 2018, at 2:04 PM, John Wieczorek notifications@github.com<mailto:notifications@github.com> wrote:

Our distinct value lists from 2017 are more than a year old now. We intended to try to make annual copies of these, so any time now will be good to gather these again.

John can do this for VertNet and request it of GBIF.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-2Dqa-2Dmanage_issues_39&d=DwMCaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=kfMMH9WUeRccRypYVmvAUyDkthdOMqe2-Ckt4WFxESQ&s=gOiMDaXKQoRzLMuPv35SsNUZcTV3JtCC2hg1Tkool_Q&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC2gS0y10LDc190KtR7LdO-2DgpR6t2kChks5t5aDWgaJpZM4UZx9g&d=DwMCaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=kfMMH9WUeRccRypYVmvAUyDkthdOMqe2-Ckt4WFxESQ&s=JorR4kwi5i9hOvcmKUo8fgCT4hPaOJsjcxzrvp9GFOs&e=.

Tasilee commented 6 years ago

Thanks Deb: I have notified our lead ALA developer Nick Dosremedios about this.

tucotuco commented 6 years ago

I have made the request to Tim Robertson at GBIF who put them together for us last time. I'll work on the VertNet values.

nickdos commented 6 years ago

I'm happy to generate such lists from the ALA. Is there is set of DwC fields (and/or non-DwC fields) that I can use to query with?

tucotuco commented 6 years ago

Thanks, Nick. We have been trying to gather distinct value lists for terms (with count) for Occurrences that might benefit from controlled vocabularies. Here is a list of what others have been summarizing:

basisOfRecord continent countrycode country day disposition establishmentMeans geodeticDatum georeferenceVerificationStatus identificationQualifier identificationVerificationStatus islandGroup island language license lifeStage month nomenclaturalCode occurrenceStatus organismScope preparations reproductiveCondition sex taxonRank taxonomicStatus typeStatus type verbatimSRS waterbody

It looks like iDigBio also added some indexed versions of terms for comparisons of interest ( https://github.com/tdwg/dwc-qa/tree/master/data/idigbioDistinctValues).

And here is an example csv from last year from VertNet for basisOfRecord with header to include DwC term name and "reps" as the number of Occurrences it appeared in:

https://github.com/tdwg/dwc-qa/blob/master/data/VNDistinctValues/VertNet_distinct_basisOfRecord_2017-02-14.csv

On Tue, Jun 5, 2018 at 2:24 AM, Nick dos Remedios notifications@github.com wrote:

I'm happy to generate such lists from the ALA. Is there is set of DwC fields (and/or non-DwC fields) that I can use to query with?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39#issuecomment-394585521, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP6zxz1rSfzNvZR9FwmFS9eGN6Jg0Wks5t5hX-gaJpZM4UZx9g .

debpaul commented 6 years ago

Yes please Nick, if you are also willing/able to share the indexes versions - that would be great. These are super useful for helping people to understand indexing...(and more).

Excited to have you on board. Thank you.

Sent from Shoe (my iPhone)

On Jun 5, 2018, at 8:41 AM, John Wieczorek notifications@github.com<mailto:notifications@github.com> wrote:

Thanks, Nick. We have been trying to gather distinct value lists for terms (with count) for Occurrences that might benefit from controlled vocabularies. Here is a list of what others have been summarizing:

basisOfRecord continent countrycode country day disposition establishmentMeans geodeticDatum georeferenceVerificationStatus identificationQualifier identificationVerificationStatus islandGroup island language license lifeStage month nomenclaturalCode occurrenceStatus organismScope preparations reproductiveCondition sex taxonRank taxonomicStatus typeStatus type verbatimSRS waterbody

It looks like iDigBio also added some indexed versions of terms for comparisons of interest ( https://github.com/tdwg/dwc-qa/tree/master/data/idigbioDistinctValues).

And here is an example csv from last year from VertNet for basisOfRecord with header to include DwC term name and "reps" as the number of Occurrences it appeared in:

https://github.com/tdwg/dwc-qa/blob/master/data/VNDistinctValues/VertNet_distinct_basisOfRecord_2017-02-14.csv

On Tue, Jun 5, 2018 at 2:24 AM, Nick dos Remedios notifications@github.com<mailto:notifications@github.com> wrote:

I'm happy to generate such lists from the ALA. Is there is set of DwC fields (and/or non-DwC fields) that I can use to query with?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39#issuecomment-394585521, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP6zxz1rSfzNvZR9FwmFS9eGN6Jg0Wks5t5hX-gaJpZM4UZx9g .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-2Dqa-2Dmanage_issues_39-23issuecomment-2D394758328&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=4fmrlYR4O1sWq4nuVvWARPa1S_owtOvt2zdMaUbwix0&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC2gS2iZODxdumArs088hgkMbyAk4P6gks5t5qafgaJpZM4UZx9g&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=1L4sC4obQveXdcwfAUyp1dAqMS1lwjoXGvdywK89ApI&e=.

tucotuco commented 6 years ago

It might also be interesting for all of us to add distinct values for the year term.

On Tue, Jun 5, 2018 at 12:51 PM, Debbie Paul notifications@github.com wrote:

Yes please Nick, if you are also willing/able to share the indexes versions - that would be great. These are super useful for helping people to understand indexing...(and more).

Excited to have you on board. Thank you.

Sent from Shoe (my iPhone)

On Jun 5, 2018, at 8:41 AM, John Wieczorek <notifications@github.com< mailto:notifications@github.com>> wrote:

Thanks, Nick. We have been trying to gather distinct value lists for terms (with count) for Occurrences that might benefit from controlled vocabularies. Here is a list of what others have been summarizing:

basisOfRecord continent countrycode country day disposition establishmentMeans geodeticDatum georeferenceVerificationStatus identificationQualifier identificationVerificationStatus islandGroup island language license lifeStage month nomenclaturalCode occurrenceStatus organismScope preparations reproductiveCondition sex taxonRank taxonomicStatus typeStatus type verbatimSRS waterbody

It looks like iDigBio also added some indexed versions of terms for comparisons of interest ( https://github.com/tdwg/dwc-qa/tree/master/data/idigbioDistinctValues).

And here is an example csv from last year from VertNet for basisOfRecord with header to include DwC term name and "reps" as the number of Occurrences it appeared in:

https://github.com/tdwg/dwc-qa/blob/master/data/VNDistinctValues/VertNet_ distinct_basisOfRecord_2017-02-14.csv

On Tue, Jun 5, 2018 at 2:24 AM, Nick dos Remedios < notifications@github.commailto:notifications@github.com> wrote:

I'm happy to generate such lists from the ALA. Is there is set of DwC fields (and/or non-DwC fields) that I can use to query with?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39# issuecomment-394585521, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAcP6zxz1rSfzNvZR9FwmFS9eGN6Jg0Wks5t5hX-gaJpZM4UZx9g .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense. proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc- 2Dqa-2Dmanage_issues_39-23issuecomment-2D394758328&d=DwMFaQ&c= HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m= A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s=4fmrlYR4O1sWq4nuVvWARPa1S_ owtOvt2zdMaUbwix0&e=, or mute the threadhttps://urldefense. proofpoint.com/v2/url?u=https-3A__github.com_notifications_ unsubscribe-2Dauth_AC2gS2iZODxdumArs088hgkMbyAk4P6gks5t5qafgaJpZM4UZx9g&d= DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m= A0gHnoTnw37sP0c8dhN6cSw9wDJvSWmJY9TS5zIoxyo&s= 1L4sC4obQveXdcwfAUyp1dAqMS1lwjoXGvdywK89ApI&e=.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39#issuecomment-394762045, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP67uOBLXaaHY2LIz2cJkzc4oibmTMks5t5qkNgaJpZM4UZx9g .

debpaul commented 6 years ago

That would be most useful, instructive, and entertaining Deb

Sent from Shoe (my iPhone)

On Jun 5, 2018, at 9:06 AM, John Wieczorek notifications@github.com<mailto:notifications@github.com> wrote:

It might also be interesting for all of us to add distinct values for the year term.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_VertNet_dwc-2Dqa-2Dmanage_issues_39-23issuecomment-2D394767147&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=gQ9fkQr09XHiG6nGXg1Af-5pmw71ILxlpRaa19i8e5g&s=CShAGMQZNZnfaHvIxPfUavn0X7zPsfgw0TS4U_DTwpo&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AC2gS1Y-2Dltew0zbRRozB0UduuvA6KxG4ks5t5qx2gaJpZM4UZx9g&d=DwMFaQ&c=HPMtquzZjKY31rtkyGRFnQ&r=ODXYRdWm1Oqf5-w5G2NjQw&m=gQ9fkQr09XHiG6nGXg1Af-5pmw71ILxlpRaa19i8e5g&s=rUyjA9XWG7_LTMCRqPGWsGeiS6ZAKKDHJ44vVQWNsgE&e=.

tucotuco commented 6 years ago

VertNet distinct values added in commit https://github.com/tdwg/dwc-qa/commit/449824b992c74a351e94b3f4d4b6330fb5711e86.

nickdos commented 6 years ago

I've managed to pull out unique values for a subset of fields from the ALA SOLR index. We don't index all fields, so the missing fields might be able to be generated via a Cassandra (I don't know how to). I figured this subset would be a good start and our next major release should include all DwC fields (we're moving to a clustered architecture to handle the bigger data).

Should I attach the TXT file to this issue or commit it to a directory or another repo - I noticed the comment above references a commit that is not linked in this repo, so wanted to check first.

Edit: ZIP file with shell script and output from script

fields used: basis_of_record country_code country month year establishment_means raw_identification_qualifier license occurrence_status_s reproductive_condition_s raw_sex rank type_status

tucotuco commented 6 years ago

Hi Nick, That's great. If you clone or fork the tdwg/dwc-qa repository, create a new branch, add a folder for ALA, add the files to that folder, commit, push and make a pull request, that would be ideal.

On 22:39, Tue, Jul 3, 2018 Nick dos Remedios notifications@github.com wrote:

I've managed to pull out unique values for a subset of fields from the ALA SOLR index. We don't index all fields, so the missing fields might be able to be generated via a Cassandra (I don't know how to). I figured this subset would be a good start and our next major release should include all DwC fields (we're moving to a clustered architecture to handle the bigger data).

Should I attach the TXT file to this issue or commit it to a directory or another repo - I noticed the comment above references a commit that is not linked in this repo, so wanted to check first.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/VertNet/dwc-qa-manage/issues/39#issuecomment-402337929, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcP63LZUJH6xSxr2_2pcG8dx1pXsATmks5uDBzUgaJpZM4UZx9g .

nickdos commented 6 years ago

Hi @tucotuco, I've created another PR with some changes, including the suggested readme file, using sub-directories with date, as well as indicating "index" values in the file name, similar to how iDigBio does it.