NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 1 forks source link

[Metadata Improvement]: Figshare filtration strategy #155

Open gtsueng opened 3 months ago

gtsueng commented 3 months ago

Issue Name

Figshare filtration strategy

Issue Description

Figshare has too many potentially irrelevant records. In order to reduce potential pollution of the Discovery Portal by Figshare, we should investigate potential filtration strategies.

Potential strategies:

Issue Discussion

The issue with Figshare is referenced in the comments for the following related issue: https://github.com/NIAID-Data-Ecosystem/niaid-feedback/issues/140

Please select the type of metadata improvement

Meta URL

No response

Related WBS task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/9

For internal use only. Assignee, please select the status of this issue

Status Description

No response

Request status check list

hartwickma commented 2 months ago

Hi @gtsueng, it is not clear what input is requested from NIAID. Please provide a brief overview including Scripps' recommendations for how to proceed and bring these to an upcoming bi-weekly meeting. Note that in the issue that is referenced above, the NIAID ask was referring to a universal solution for the Portal and not just FigShare.

NIAID would like to discuss strategies for filtering non-IID records as an agenda item in an upcoming meeting. Please suggest as an agenda item when Scripps is ready to present potential solutions.

gtsueng commented 2 months ago

Hi @hartwickma @rshabman @lisa-mml @sudvenk

The three approaches above can apply to resources outside of Figshare as well. Because we have augmented each repository with topicCategory information, we could potentially filter by topic as well. The input needed from NIAID would be the funder-based, healthCondition-based, pathogen-based, or topic-based criteria for discriminating whether or not a record should be considered IID. It was mentioned at the in-person meeting that there is a list of inclusion criteria that consists of terms, funders, pathogen, diseases, etc. We propose using this list for discriminating as to whether or not a dataset is considered IID or not.

@rshabman previously provided a list of potential pathogen which can be used for the inclusion criteria, but this would only cover infectious disease, not allergies and immunological disorders.

hartwickma commented 2 months ago

Hi @gtsueng, thank you for providing this overview. If we follow correctly, your recommendation is the 3rd option? Plus potentially including topic categories? NIAID can discuss the internally the idea of list-based inclusion criteria. Please provide a bit more information about what is meant by the 'con' for option 3: As this list is internal to NIAID, the mapping may need to be done internally as well.

We look forward to discussing this further at an upcoming biweekly meeting.

gtsueng commented 2 months ago

Hi @hartwickma,

The Figshare categories are domain/discipline-based categories similar to the topicCategories. The difference between Figshare categories and topicCategories is that topicCategories are based on EDAM Topics ontology which is restricted to the biological/biomedical sciences. (Figshare has Geosciences, and other disciplines not traditionally considered biological/biomedical and not in EDAM).

Approach # 1 is funder-based, which we don't recommend (but we know NLM employs this to some extent for PubMed) Approach # 2 is topic-based, which may work better than funder-based (since we can augment topicCategories to a greater extent than we can augment funder information) Approach # 3 combines the two, which can potentially work better than just one or the other

The con we described is that the criteria list (mentioned by Reed at the internal meeting) was considered an internal list to NIAID. Reed provided us with the list of pathogen that were on that internal list, but it's unclear if there were additional domain/discipline/topicCategory-based criteria, health condition/disease criteria, etc. that could be used. Since we don't have access to this internal list, we could provide NIAID with the list of the topicCategories and let you select which ones to include. Same with health condition/disease.

Alternatively, if the NIAID team could provide a list of domain/discipline-based criteria or health condition/disease criteria, we could do the mappings to EDAM Topics, MONDO, DO, etc. and include all children.

I have added this issue as a potential agenda item for discussion on August 20th.

gtsueng commented 2 months ago

This issue was discussed at the 2024.08.20 biweekly meeting. Rather than starting with an inclusion criteria, NIAID has proposed starting with a general exclusion criteria to filter out any records that are not life science/biomedically relevant.

Towards this end, Approach #2 would be suitable considering that EDAMT ontology is limited to the biomedical/life sciences arena and does not have categories for things like Geosciences, Astronomy, etc. The most straightforward approach to start with, would be to filter out any records that don't have topicCategory values assigned (because the topic was not available in the list of EDAMT). A weakness of this approach is that ChatGPT has weaker performance for records with fewer characters, so we would need to determine how to best handle records that don't meet the length requirement needed for an accurate classification by ChatGPT

The following actions may be helpful for determining a path forward:

That said, the NIAID team will discuss the matter further and inform the Scripps team on how they would like to move forward.

hartwickma commented 2 months ago

Hi @gtsueng, thank you for providing this overview of the discussion from yesterday's meeting. NIAID would like to move forward with the strategy as outlined above for approach #2. We are also interested in including an option for the user to apply or remove the filter according to their interest/research needs.

gtsueng commented 2 months ago

Thanks for the update @hartwickma. Regarding the user option to apply or remove the filter, can you clarify if you mean for this to apply to the crude exclusion criteria (of being in the life/biomedical sciences space) or if such a filter would be applied to a more narrow, inclusion-based criteria?

gtsueng commented 2 months ago

Related discussions: https://github.com/NIAID-Data-Ecosystem/niaid-feedback/issues/113