Create JSON Table Schema for metadata.tsv

danfowler commented 8 years ago

Good Tables can validate both the structure of a dataset as well as its adherence to a published schema.

danfowler commented 7 years ago

For this issue, I have created a datapackage.json in our /frictionlessdata/ fork of the ADB-User-Study repo. The JSON Table Schema for the only file in the "package" ('metadata.tsv') is embedded therein:

https://github.com/frictionlessdata/ADB-User-Study/blob/master/datapackage.json

danfowler commented 7 years ago

@samuelpayne @coldfire79 can you take a look at the schema that I have created to make sure it captures the kind of data validation you would like to see?

For example:

We could specify that the ID field needs to be exactly 9 characters
I classified 'Missing' as a missingValue for "PlatinumStatus"
For lots of other fields, I classified one of "[Not Applicable]", "[Not Available]", and "[Pending]" as missingValues.
We could further specify that "age_at_initial_pathologic_diagnosis" by > 0
Is "NOT HISPANIC OR LATINO" the only possible value for "ethnicity"?
For "primary_therapy_outcome_success" I found only four possible values in the set: ["COMPLETE RESPONSE","PARTIAL RESPONSE","PROGRESSIVE DISEASE","STABLE DISEASE"]
- Similar reasoning for "race"
"tissue_source_site" is categorical: is there a defined set of possible values here?
I added a guess for possible "tumor_stage"s
For "Normal_Control" I specified only "Germline Blood" or "Solid Tissue Normal (DNA)"

coldfire79 commented 7 years ago

We could specify that the ID field needs to be exactly 9 characters Not really. These IDs are randomly regenerated because of some privacy issue. However, I wonder if we can use some regular expressions to make a rule for IDs. Also it should keep having uniqueness.
I classified 'Missing' as a missingValue for "PlatinumStatus" Yes. But it looks having some dependency on "PlatinumFreeInterval". So is there any way to add these dependencies in the schema?
For lots of other fields, I classified one of "[Not Applicable]", "[Not Available]", and "[Pending]" as missingValues. Yes, it's true.
We could further specify that "age_at_initial_pathologic_diagnosis" by > 0 Yup, it sounds good.
Is "NOT HISPANIC OR LATINO" the only possible value for "ethnicity"? For this dataset, Yes. But it could be different in different datasets.
For "primary_therapy_outcome_success" I found only four possible values in the set: ["COMPLETE RESPONSE","PARTIAL RESPONSE","PROGRESSIVE DISEASE","STABLE DISEASE"]
Similar reasoning for "race"
"tissue_source_site" is categorical: is there a defined set of possible values here?
I added a guess for possible "tumor_stage"s
For "Normal_Control" I specified only "Germline Blood" or "Solid Tissue Normal (DNA)" For this specific dataset, these look fine. But I think we need more datasets to identify these questions. Let me look for the datasets and protocols for clinical metadata in the cancer study.

samuelpayne commented 7 years ago

Regarding the first bullet – each project will have its own way of creating a unique ID. they may (or may not) have actual rules regarding these.

Regarding ethnicity – I am not sure what clinical/demographic categories will be for each project.

From: Joon-Yong Lee [mailto:notifications@github.com] Sent: Wednesday, December 14, 2016 10:54 AM To: frictionlessdata/pilot-pnnl Cc: Payne, Samuel H; Mention Subject: Re: [frictionlessdata/pilot-pnnl] Create JSON Table Schema for metadata.tsv (#4)

· We could specify that the ID field needs to be exactly 9 characters Not really. These IDs are randomly generated because of some privacy issue. However, I wonder if we can use some regular expressions to make a rule for IDs. Also it should keep having uniqueness.

· I classified 'Missing' as a missingValue for "PlatinumStatus" Yes. But it looks having some dependency on "PlatinumFreeInterval". So is there any way to add these dependencies in the schema?

· For lots of other fields, I classified one of "[Not Applicable]", "[Not Available]", and "[Pending]" as missingValues. Yes, it's true.

· We could further specify that "age_at_initial_pathologic_diagnosis" by > 0 Yup, it sounds good.

· Is "NOT HISPANIC OR LATINO" the only possible value for "ethnicity"? For this dataset, Yes. But it could be different in different datasets.

· For "primary_therapy_outcome_success" I found only four possible values in the set: ["COMPLETE RESPONSE","PARTIAL RESPONSE","PROGRESSIVE DISEASE","STABLE DISEASE"]

· Similar reasoning for "race"

· "tissue_source_site" is categorical: is there a defined set of possible values here?

· I added a guess for possible "tumor_stage"s

· For "Normal_Control" I specified only "Germline Blood" or "Solid Tissue Normal (DNA)" For this specific dataset, it looks fine. But I think we need more datasets to identify these questions. Let me look for the datasets and protocols for clinical metadata in the cancer study.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/frictionlessdata/pilot-pnnl/issues/4#issuecomment-267121865, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACXeoT8m1xbVBM2p4zDHfYiIRzo3X3eaks5rIDtHgaJpZM4KcUA2.

danfowler commented 7 years ago

We could specify that the ID field needs to be exactly 9 characters

Not really. These IDs are randomly regenerated because of some privacy issue. However, I wonder if we can use some regular expressions to make a rule for IDs. Also it should keep having uniqueness.

Regarding the first bullet – each project will have its own way of creating a unique ID. they may (or may not) have actual rules regarding these.

We can do regular expressions. I have added a unique constraint. I have also (redundantly) set it as the "primaryKey" for the table.

I classified 'Missing' as a missingValue for "PlatinumStatus"

Yes. But it looks having some dependency on "PlatinumFreeInterval". So is there any way to add these dependencies in the schema?

No way to have one column depend on another column in the schema.

We could further specify that "age_at_initial_pathologic_diagnosis" by > 0

Yup, it sounds good.

Added.

https://github.com/frictionlessdata/ADB-User-Study/commit/fb2b9b08ba169c11b191d052f68102c30f92ecbd

Tweety79rw commented 7 years ago

@danfowler @coldfire79 @samuelpayne Hi Dan, I am planning to make a program to generates a basic schema for metadata.tsv based off what you have made for adb_user_study's. This way someone that creates a project for adbio doesn't have to provide their own. Is there any specifics about the scheme that I should know to make it integrate with good tables. Or if there is anything you need me to help with let me know.

danfowler commented 7 years ago

@Tweety79rw @coldfire79 @samuelpayne

Hi Dan, I am planning to make a program to generates a basic schema for metadata.tsv based off what you have made for adb_user_study's. This way someone that creates a project for adbio doesn't have to provide their own.

I wonder if a tool like this could be one of the ultimate deliverables for this pilot. Something like Datapackagist could theoretically provide a foundation for setting types per column, but it provides no support for defining constraints, missingValues, etc. I've created a new issue: https://github.com/frictionlessdata/pilot-pnnl/issues/15

Let's see what @roll and @amercader think here.

Is there any specifics about the scheme that I should know to make it integrate with good tables. Or if there is anything you need me to help with let me know.

Good Tables should ultimately support whatever is defined for the schema.

danfowler commented 7 years ago

Table Schema for metadata.tsv created: https://github.com/frictionlessdata/ADB-User-Study/blob/master/metadata-schema.json

frictionlessdata / pilot-pnnl

Create JSON Table Schema for metadata.tsv #4