Library Minimal Metadata review

patrick-lloyd-ray commented 1 year ago

[x] add Kim as major contributor

lydiang commented 1 year ago

Hi @patrick-lloyd-ray

I see one of the proposed field name is "2nd&3rd round barcode" - would be good to avoid characters like ampersand, slashes etc in the field name ...

lydiang commented 1 year ago

@patrick-lloyd-ray

In the official CSV file, I see that there is sub_domain category (Receive Sample, Generate Library, Pool Library, Delivery Library Pool).

These sub-domains are too broad and makes it difficult to tie the metadata to specific entities. In the original file, the was a "SubGroup" field can that be use instead or in addition to?

This will make it easier to users of this schema to pull out metadata specific to a particular entity.

For example, "antibody information" is not obvious for the official CSV file that this is a property of a barcoded cell sample.

patrick-lloyd-ray commented 1 year ago

@lydiang, yes, that's no problem. Just to check my understanding: you would like the sub-domain categories to be changed to what were called the 'sub-group' fields in the csv. So, I'd be eliminating receive sample, generate library, pool library et al. and adding amplified cDNA, barcoded cell sample et al., right?

If that's the case, @hongna2008 could you please verify this change with your group as well?

lydiang commented 1 year ago

@patrick-lloyd-ray

The "subgroup" needs to be at least need to be in addition to to make it usable

patrick-lloyd-ray commented 1 year ago

@lydiang -- that makes sense to me. As long as there are no objections from the Task Force, I'm okay with adding or changing (just want to verify with them).

hongna2008 commented 12 months ago

@lydiang, yes, that's no problem. Just to check my understanding: you would like the sub-domain categories to be changed to what were called the 'sub-group' fields in the csv. So, I'd be eliminating receive sample, generate library, pool library et al. and adding amplified cDNA, barcoded cell sample et al., right?

If that's the case, @hongna2008 could you please verify this change with your group as well?

@patrick-lloyd-ray @lydiang The "SubGroup" field is only used to "generate library" in the current version. It is not applied to the other 3 subdomains. I think we need a discussion before we make any adjustment to these four basic domain categories which has been approved by multiple groups.

lydiang commented 12 months ago

Hi @hongna2008,

Another approach would be not to adjust your four basc domain at all but an include the "subgroup" in addition the formalization. These "subgroup" is already a vetted part of your metadata.

Having them tag as part of the official metadata will make it useable to portals who are modeling entities at the subgroup level.

lydiang commented 12 months ago

@patrick-lloyd-ray, @hongna2008

Kim and I were looking at your PR today, we believe that you might be using a old version of the library minimal metadata. The finalized version is version 4.2 This is the link to the version https://docs.google.com/spreadsheets/d/1mhMQz20t4dNx-Z17wk6FITiKPrK9b7Yu/edit?usp=sharing&ouid=117290496276111470013&rtpof=true&sd=true

Some of the difference that caught our eye are: In the github CSV file you have fields

1st round barcodes
2nd and 3rd round barcode
Antibody information

These no longer exists in version 4.2

These fields are subsumed by "Enriched cell source barcode name" and "Histone modification marker"

@hongna2008 , can you coordinate with @patrick-lloyd-ray to make sure the final version of metadata is used?

hongna2008 commented 12 months ago

@lydiang Thank you for pointing this out. The latest updates by Kim were not synchronized here as this version was uploaded before that. I uploaded the latest version (named Library_Minimal_Metadata_Latesd_Version_final) in the released issue, @patrick-lloyd-ray sorry for the version confusion.

patrick-lloyd-ray commented 12 months ago

Thanks for the catch! I'll work on getting this updated and convert back to "ready for review" after I've made the changes.

lydiang commented 11 months ago

Pasting Kim's latest email here as reference

From: Kimberly Smith KimberlyS@alleninstitute.org Sent: Monday, November 6, 2023 5:30 PM To: Zheng, Wenjin J Wenjin.J.Zheng@uth.tmc.edu; Hong, Na na.hong@yale.edu; Patrick Ray patrick.ray@alleninstitute.org Cc: Lydia Ng LydiaN@alleninstitute.org; Kimberly Smith KimberlyS@alleninstitute.org Subject: Update to Library Minimal Metadata _231106-KS

Jim, Na, and Patrick –

With the link below I am providing a retroactive backfill of operational fields for the Library Minimal Metadata list.

https://docs.google.com/spreadsheets/d/1toLUMfGNI6oulr5I2f7RzfJTaF3Es1YF/edit#gid=1498873717

(Library_Minimal_Metadata_Latesd_Version_update-231106_KS)

These include the addition of the following:

Barcode Cell Sample Method + Value Set (this is already in the SeqPortal) Library Aliquot QC Result + Value Set (this is being deployed in SeqPortal this week) R1/R2 Index Name Value Set additions (from Joe Nery)

All updates are highlighted in bright blue.

Please let me know if you have questions!

Kim

patrick-lloyd-ray commented 11 months ago

@rightbower and @djarecka and @puja-trivedi could you please verify the data types in this csv are correct based on your expectations?

puja-trivedi commented 11 months ago

could you please verify the data types in this csv are correct based on your expectations?

@patrick-lloyd-ray - I think many of the fields that are currently "integer" should actually be "float" based off the data examples given in the green table. For example amplified cDNA amplified quantity ng, amplified cDNA percent cDNA longer than 400bp, and Library input ng are all set to "integer" but the examples given are "float". I can send you a more detailed list of the fields I think have the incorrect data type if that would be helpful.

patrick-lloyd-ray commented 11 months ago

could you please verify the data types in this csv are correct based on your expectations?

@patrick-lloyd-ray - I think many of the fields that are currently "integer" should actually be "float" based off the data examples given in the green table. For example amplified cDNA amplified quantity ng, amplified cDNA percent cDNA longer than 400bp, and Library input ng are all set to "integer" but the examples given are "float". I can send you a more detailed list of the fields I think have the incorrect data type if that would be helpful.

Thanks, it would be really helpful to know which are throwing errors. You can either send me a list or just make suggested changes directly in the PR, whichever is easiest for you.

puja-trivedi commented 11 months ago

Thanks, it would be really helpful to know which are throwing errors. You can either send me a list or just make suggested changes directly in the PR, whichever is easiest for you.

@patrick-lloyd-ray - I made suggestions on the fields I had issues with but I think @djarecka might have a few more suggestions she might want to add.

patrick-lloyd-ray commented 11 months ago

Thanks, it would be really helpful to know which are throwing errors. You can either send me a list or just make suggested changes directly in the PR, whichever is easiest for you.

@patrick-lloyd-ray - I made suggestions on the fields I had issues with but I think @djarecka might have a few more suggestions she might want to add.

Thank you! I'll wait until @djarecka chimes in before I merge.

djarecka commented 11 months ago

@patrick-lloyd-ray - Looks like @puja-trivedi already marked all types that had integer in the data type field and had float examples in the green table.

I have some additional issues with examples that are not in value sets, but looks like the list of values are not in this PR.

djarecka commented 11 months ago

@patrick-lloyd-ray - a few general comments/requests:

in addition to column with examples and "in_subset", could we also have something like "SubGroup" (basically the name of the class the slot should belong). At the beginning I thought that this is under "sub domain name" column, but it seems like this is a name of the tab from the green table
I'm wondering if "version" and "version_date" should be the name of the columns or just the version of entire document? Or this is meant to be the last change for the specific slot?
I'm also not sure about the "size" and "measurement type" columns, what do you expect to keep there?

patrick-lloyd-ray commented 11 months ago

[x] add units to sheet (@pbishwakarma)
[ ] add size (@rightbower)
[ ] add minimum value (@rightbower)
[ ] add maximum value (@rightbower)
[x] add statistical type (@patrick-lloyd-ray)
[x] add in_subset (@puja-trivedi)
[x] change is_required to nullable (@patrick-lloyd-ray)
[x] add LinkML classes (@puja-trivedi)
[x] add LinkML slots/attributes (@puja-trivedi)

brain-bican / metadata-schemas

Library Minimal Metadata review #31