datacarpentry / organization-genomics

Project Organization and Management for Genomics
https://datacarpentry.org/organization-genomics
Other
23 stars 76 forks source link

Metadata example in messy spreadsheet #67

Closed hoytpr closed 6 years ago

hoytpr commented 6 years ago

In 01_tidiness_datasheet_example_messy.png, there are "description" columns. These columns have spaces because the contents must NOT be critical for bcl2fastq. So these are metadata columns and later in the cleaned spreadsheet, they continue to have spaces. (Note: our Illumina sequencing instrument submission sheets do not allow metadata that I am aware of.) I propose we change the column descriptions from "Study_Description" to "Study_Metadata" and "Biosample_Description" to "BioSample_Metadata". The "Sample_Owner" must also be a metadata column, and could be called "Owner_Metadata". Without these changes learners would likely put underlines for all spaces in these columns as well. This is an opportunity to make it clear that metadata can exist in a spreadsheet that also contains data, and so should be labeled clearly.

ErinBecker commented 6 years ago

Hi @hoytpr - Thanks for this PR! I'm not a Maintainer for this lesson (although I am interested in following the group's progress). Can I suggest that you add @raynamharris to your assignments? She is one of the new Maintainers for this lesson.

hoytpr commented 6 years ago

Hi @ErinBecker Yes, I will. I was using the "assignees" button (lazy). I should also include @Roselynlemusinmegen if my notes are correct. This isn't a PR yet. Hoping for some feedback/discussion, but I clearly use labels wrong, so didn't add one. :)

hoytpr commented 6 years ago

Because no one has commented on this I plan to close this issue on Aug 2, 2018 without making any changes.

hlapp commented 6 years ago

I'd argue that *_Metadata is more (and confusingly so) generic, and not equivalent to *_Description. This (the concept of metadata) is also what the lesson talks about, so it could be quite confusing to talk about metadata as a category of data (i.e., data about data) in one context and then have it as a column side by side with other columns that are also metadata (such as identifiers, for example!).

I think the instructors (and possibly the lesson text itself – maybe that's what you really mean?) need to bring out the distinctions between types of metadata, and what the consequences of those distinctions are. For example, some metadata columns will be used by machines to process data (de-multiplexing, indexing, summarizing, query-joining, etc), and those should therefore not contain anything that can impede that in the future. Other metadata columns, however, are for humans, not for machines, and thus should be kept best readable by humans.

hoytpr commented 6 years ago

Thanks @hlapp and I agree a distinction needs to be made. So maybe "process_metadata" vs "human_metadata"? That would clearly identify the intent of the column. Expanding on the "process_metadata" could be "owner_process_metadata" if used by machines to process. If not used in data processing, the description expansion would be "owner_human_metadata". This reduces the confusion between descriptions and metadata (it's for humans, or for the machine process). Good feedback!

hlapp commented 6 years ago

So maybe "process_metadata" vs "human_metadata"?

If you're suggesting that for column names, they may be good name but I've never seen these used in reality for sample submission to any sequencing center. They may be good names, but I think it'd be confusing to use sample data that bear little to no resemblance to real data.

hoytpr commented 6 years ago

@hlapp You make a good point. So we should potentially have instructors bring out the distinctions between types of metadata as you suggest. For the purposes of the lesson example, should we also include text from the bcl2fastq guide? "You can use alphanumeric characters, hyphens [-], and underscores [_] for the Sample_Project, Sample_ID, and Sample_Name.

Sample_ID, Sample_Name, and Sample_Project field entries in the sample sheet cannot contain illegal characters that are not allowed by some file systems. Examples of common characters that are not allowed are the space character and the following: ?()[]/\=+<>:;"',*^| &."

Unfortunately, this is not a universal "rule" for metadata.

hoytpr commented 6 years ago

Will need to clarify metadata vs. descriptions at some point.