NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Source]: VEuPathDB collections #135

Open gtsueng opened 2 months ago

gtsueng commented 2 months ago

Source Name

VEuPathDB collections

Source URL

see description

Source Description

VEuPathDB hosts numerous collections which are treated as sources in their own right. This includes GiardiaDB, CryptoDB, etc.

These collections are as follows: AmoebaDB | https://amoebadb.org/amoeba/app CryptoDB | https://cryptodb.org/cryptodb/app GiardiaDB | https://giardiadb.org/giardiadb/app HostDB | https://hostdb.org/hostdb/app PlasmoDB | https://plasmodb.org/plasmo/app VectorBase | https://vectorbase.org/vectorbase/app FungiDB | https://fungidb.org/fungidb/app MicrosporidiaDB | https://microsporidiadb.org/micro/app ToxoDB | https://toxodb.org/toxo/app TrichDB | https://trichdb.org/trichdb/app TriTrypDB | https://tritrypdb.org/tritrypdb/app PiroplasmaDB | https://piroplasmadb.org/piro/app

The URL structure for a record in these databases are similar to that of VEuPathDB: Structure: https://{base_url}/record/dataset/{identifier} Example: https://amoebadb.org/amoeba/app/record/dataset/DS_63733e001b Identical record in VEuPathDB: https://veupathdb.org/veupathdb/app/record/dataset/DS_63733e001b

As seen above, the record ID is identical between the resources.

Desired outcome: We would like to add an includedInDataCatalog value for each of the records hosted in these "databases". Currently, they are likely to be ingested via VEuPathDB and have an includedInDataCatalog.name value of veupathdb. We would like it to have the includedInDataCatalog values of [{name: veupathdb, etc.},{name:otherdb, etc}] for whichever DB the record is also a part of.

To do:

Caveats:

Source Access

No access issue, account not needed

Source Funding

NIAID

Source Relevance

NIAID-funded

Related WBS task

For internal use only. Assignee, please select the status of this issue

Status Description

Please hold on starting this issue until the NIAID team confirms the desired outcome. See https://github.com/NIAID-Data-Ecosystem/niaid-feedback/issues/127

Source to-do list

gtsueng commented 1 month ago

As seen in https://github.com/NIAID-Data-Ecosystem/niaid-feedback/issues/127, this issue has been approved for work start

gtsueng commented 1 month ago

@jal347 I've assigned this issue to you. Please let me know if you have any questions, after you've had a look at it.

If you need an example, consider AmoebaDB https://amoebadb.org/amoeba/app. Each record in AmoebaDB should already be in the NIAID Data ecosystem because each record is also in VEuPathDB (which is already included). What we want to do is identify all the records in AmoebaDB and add to each record:

"includedInDataCatalog" : { "@type": "DataCatalog", "name": "AmoebaDB",
"url": "https://amoebadb.org/amoeba/app/record/dataset/identifier", "versionDate": "YYYY-MM-DD" }

gtsueng commented 2 weeks ago

@hartwickma, @lisa-mml, @rshabman, @sudvenk

As mentioned at the bi-weekly meeting dated 2024.06.11, the VEuPathDB collections are now available on staging and can be accessed by going to the staging site and going to the 'source' filter: Image

If you have any feedback or suggestions on how collections like the ones coming from VEuPathDB should be displayed, please provide them to the collections issue here: https://github.com/NIAID-Data-Ecosystem/niaid-feedback/issues/136

gtsueng commented 6 days ago

Per the discussion at the bi-weekly meeting dated 2024.06.25, the reason why VEuPathDB on Staging has not yet been moved to Production is because of the metadata changes caused by the merging of VEuPathDB collections data.

@DylanWelzel @jal347 is it possible to update VEuPathDB on Production without the data merged in from VEuPathDB collections? If so, please proceed to do so. If not, then we will wait until VEuPathDB collections are approved before proceeding.

hartwickma commented 5 days ago

Hi @gtsueng, thanks for following up on this item to update the VEuPathDB to the most recent release. NIAID approves the VEuPathDB collections for production, and this will hopefully remove the metadata merge complication.

There is also an issue with the current display of the the VEuPathDB in the filter function that needs to be updated as well. The filter options under 'Resources' needs to reflect that these VEuPathDB collections are part of VEuPathDB. Please:

  1. Move the VEuPathDB collections from where they are listed under 'Other Resources' so that they appear with VEuPathDB under 'IID Resources'
  2. We would like the appearance of these VEuPathDB collections to indicate that they are part of the larger VEuPathDB repository, so please consider how best that this can be done (eg: indented, bullets, etc) in the filter list.

We also discussed a related item in the meeting yesterday about the new sub headers in the filter section, where new terms (Other Resources, Basic science Repositories). Please remove these terms and return to the previously agreed upon Domains of 'IID' and 'Generalist Repositories' while discussions continue about how best to make domain assignments and assign terms.

gtsueng commented 5 days ago

Hi @hartwickma,

@DylanWelzel and I discussed this earlier and he confirmed that it would be possible to update VEuPathDB on Production without pushing VEupathDB collections. We will proceed to update VEuPathDB on Production without VEuPathDB collections as the following requests may require additional changes on the back and front ends:

  1. Move the VEuPathDB collections from where they are listed under 'Other Resources' so that they appear with VEuPathDB under 'IID Resources'
  2. We would like the appearance of these VEuPathDB collections to indicate that they are part of the larger VEuPathDB repository, so please consider how best that this can be done (eg: indented, bullets, etc) in the filter list.

We are marking this issue as 'in progress -- refinement' to reflect the changes needed to