biostream / bioschemas

ga4gh, gdc and bmeg in one place
MIT License
3 stars 0 forks source link

Icgc #5

Closed ghost closed 7 years ago

ghost commented 7 years ago

Added schema for the file-centric document type used in DCC elasticsearch repository index.

bwalsh commented 7 years ago
$ git checkout icgc
Branch icgc set up to track remote branch icgc from origin.
Switched to a new branch 'icgc'
RJHB392:bioschemas walsbr$ cd bin
RJHB392:bin walsbr$ ./package-all.sh
proto copied into ../bioschemas/snapshot/proto/bmeg
proto copied into ../bioschemas/snapshot/proto/ga4gh
cerberus code snapshot into ../bioschemas/snapshot/cerberus/bmeg
cerberus code snapshot into ../bioschemas/snapshot/cerberus/ga4gh
cerberus code moved into ../bioschemas/snapshot/cerberus/gdc
jsonschema code generated into ../bioschemas/snapshot/jsonschema/bmeg
jsonschema code generated into ../bioschemas/snapshot/jsonschema/ga4gh
jsonschema code generated into ../bioschemas/snapshot/jsonschema/gdc
running test
running egg_info
writing requirements to Bioschemas.egg-info/requires.txt
writing Bioschemas.egg-info/PKG-INFO
writing top-level names to Bioschemas.egg-info/top_level.txt
writing dependency_links to Bioschemas.egg-info/dependency_links.txt
reading manifest file 'Bioschemas.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'Bioschemas.egg-info/SOURCES.txt'
running build_ext
bioschemas_tests.test_should_return_path ... ok
bioschemas_tests.test_paths_should_have_proto ... ok
bioschemas_tests.test_should_jsonschema ... ok
bioschemas_tests.test_should_cerberus_schema ... ok
bioschemas_tests.test_should_return_git_hashes ... ok
bioschemas_tests.test_should_have_gdc_submission_templates ... ok
bioschemas_tests.test_should_return_submission_template_by_type ... ok
bioschemas_tests.test_file_centric_is_valid ... ERROR

======================================================================
ERROR: bioschemas_tests.test_file_centric_is_valid
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/walsbr/bioschemas/bioschemas/tests/bioschemas_tests.py", line 40, in test_file_centric_is_valid
    schema = bioschemas.json_schema('file-centric')
  File "/Users/walsbr/bioschemas/bioschemas/__init__.py", line 59, in json_schema
    return json_schemas['definitions'][key]
KeyError: 'file-centric'

----------------------------------------------------------------------
Ran 8 tests in 0.104s

FAILED (errors=1)
bwalsh commented 7 years ago

It would be a good idea if https://github.com/ohsu-computational-biology/bioschemas/issues/2 was included in this.

bwalsh commented 7 years ago

Also, the git hashes should be updated w/ the icgc hash, and the test enhanced to ensure all hashes appear.

bwalsh commented 7 years ago

checking in, how is this going?

ghost commented 7 years ago

@bwalsh, the icgc schema isn't in a separate repository, so I don't have a git hash for it. how do you want to handle that? should I use the bioschemas hash?

ghost commented 7 years ago

@bwalsh @mayfielg I added the hash shapshot and fixed the error in packaging. please look again.

kellrott commented 7 years ago

My preference would be that all of the schema were in the sample format. Would it be possible to switch the json-schema over to protobuf? For these sets of messages, the protobuf would be something like:


syntax = "proto3";

package icgc;

message DataBundle {
  string data_bundle_id = 1;
}

message AnalysisMethod {
  string analysis_type = 1;
  string software = 2;
}

message DataCategorization {
  string data_type = 1;
  string experimental_strategy = 2;
}

message ReferenceGenome {
  string genome_build = 1;
  string reference_name = 2;
  string download_url = 3; 
}

message IndexFile {
  string id = 1;
  string object_id = 2;
  string file_name = 3;
  string file_format = 4;
  int64  file_size = 5;
  string file_md5sum = 6;
  string repo_file_id = 7;
}

message CopiedFile {
  string file_name = 1;
  int64  file_size = 2;
  string file_md5sum = 3;
  int64  last_modified = 4;
  IndexFile index_file = 5;
}

message OtherIdentifiers {
  string tcga_participant_barcode = 1;
  repeated string tcga_sample_barcode = 2;
  repeated string tcga_aliquot_barcode = 3;
}

message Donor {
  string project_code = 1;
  string program = 2;
  string study = 3;
  string primary_site = 4;
  string donor_id = 5;
  repeated string specimen_id = 6;
  repeated string specimen_type = 7;
  repeated string sample_id = 8;
  string matched_control_sample_id = 9;
  string submitted_donor_id = 10;
  string submitted_specimen_id = 11;
  string submitted_sample_id = 12;
  OtherIdentifiers other_identifiers = 13;
}

message FileCentric {
  string id = 1;
  string object_id = 2;
  repeated string study = 3;
  string access = 4;
  DataBundle data_bundle = 5;
  AnalysisMethod analysis_method = 6;
  DataCategorization data_categorization = 7;
  ReferenceGenome reference_genome = 8;
  repeated CopiedFile files_copied = 9;
  string repo_data_bundle_id = 10;
  string repo_file_id = 11;
  repeated string repo_data_set_ids = 12;
  string repo_type = 13;
  string repo_org = 14;
  string repo_name = 15;
  string repo_code = 16;
  string repo_country = 17;
  string repo_base_url = 18;
  string repo_data_path = 19;
  string repo_metadata_path = 20;
  repeated Donor donors = 21;
}
bwalsh commented 7 years ago

Tests look good.

$ git status
On branch icgc
Your branch is up-to-date with 'origin/icgc'.

$ cd bin
$ ./package-all.sh
proto copied into ../bioschemas/snapshot/proto/bmeg
proto copied into ../bioschemas/snapshot/proto/ga4gh
cerberus code snapshot into ../bioschemas/snapshot/cerberus/bmeg
cerberus code snapshot into ../bioschemas/snapshot/cerberus/ga4gh
cerberus code moved into ../bioschemas/snapshot/cerberus/gdc
mkdir: ../bioschemas/snapshot/jsonschema: File exists
jsonschema code generated into ../bioschemas/snapshot/jsonschema/bmeg
jsonschema code generated into ../bioschemas/snapshot/jsonschema/ga4gh
jsonschema code generated into ../bioschemas/snapshot/jsonschema/gdc
running test
Searching for nose
Best match: nose 1.3.7
Processing nose-1.3.7-py2.7.egg

Using /Users/walsbr/bioschemas/.eggs/nose-1.3.7-py2.7.egg
running egg_info
writing requirements to Bioschemas.egg-info/requires.txt
writing Bioschemas.egg-info/PKG-INFO
writing top-level names to Bioschemas.egg-info/top_level.txt
writing dependency_links to Bioschemas.egg-info/dependency_links.txt
reading manifest file 'Bioschemas.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'Bioschemas.egg-info/SOURCES.txt'
running build_ext
bioschemas_tests.test_should_return_path ... ok
bioschemas_tests.test_paths_should_have_proto ... ok
bioschemas_tests.test_should_jsonschema ... ok
bioschemas_tests.test_should_cerberus_schema ... ok
bioschemas_tests.test_should_return_git_hashes ... ok
bioschemas_tests.test_should_have_gdc_submission_templates ... ok
bioschemas_tests.test_should_return_submission_template_by_type ... ok
bioschemas_tests.test_file_centric_is_valid ... ok
bioschemas_tests.test_file_centric_validation ... ok

----------------------------------------------------------------------
Ran 9 tests in 0.062s

OK
ghost commented 7 years ago

Kyle, I'm going to try to make a protobuf that compiles to the icgc-dcc JSON.

bwalsh commented 7 years ago

@kellrott Agree that all kafka messages should be proto. The file-centric jsonschema is an internal schema used by the dcc-portal. The intent of exposing it here is as prep for the add-a-file story in our backlog.

The document represented by this schema would be manipulated by theeuler.todo (not yet written) component below as part of it's CRUD operations with elastic. It is not intended to be pushed on the wire.

image More info on the dataflows

ghost commented 7 years ago

Brian W., so you're saying we shouldn't try for a protobuf version of file-centric JSON?

bwalsh commented 7 years ago

Yes. Don't see the need short-to-mid-term (don't see us sending internal dcc structures on the wire to other systems). Will know more after we implement add-a-file. Reconsider then.

-Brian Walsh

On Wed, Jan 4, 2017 at 9:56 AM, Brian King notifications@github.com wrote:

Brian W., so you're saying we shouldn't try for a protobuf version of file-centric JSON?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270438865, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC6wDesqDM54sagfueJIBLJW71oEzKlks5rO91QgaJpZM4LCAfm .

buchanae commented 7 years ago

If the schema is internal and project specific, what's the value of adding it to bioschemas? Does it need to be shared by multiple codebases and this is the most convenient way to achieve that?

buchanae commented 7 years ago

I think consistency in formats supported by bioschemas is a worthy goal itself. If we support jsonschema and protobuf, can we commit to supporting those for every schema?

bwalsh commented 7 years ago

If the schema is internal and project specific, what's the value of adding it to bioschemas? That is a decent question. Perhaps added prematurely? Not sure. BrianK?

If we support jsonschema and protobuf, can we commit to supporting those for every schema? I thought:

  • we had settled on protobuf for our schemas
  • the primary focus was to act as a versioned clearing house for the multitude of downstream dependencies, freeing projects from having mange [copy-and-paste, git-submodule,etc.] The choice of schema downstream dependecies
  • don't recall a discussion or use-case for any-to-any schema conversion or re-writes

-Brian Walsh

On Wed, Jan 4, 2017 at 10:14 AM, Alex Buchanan notifications@github.com wrote:

I think consistency in formats supported by bioschemas is a worthy goal itself. If we support jsonschema and protobuf, can we commit to supporting those for every schema?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270443578, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC6wHZaOkTlExKqL6qA2I7tzxOjBtheks5rO-GJgaJpZM4LCAfm .

ghost commented 7 years ago

One purpose for having the file centric-schema is so that we can specify and test software that updates the DCC portal "repository" index. I think having a schema of some sort is helpful, and bioschemas seems the natural place to put it. I don’t see a use case for the protobuf version. One option to go forward is to approve the pull request as is, and convert to protobuf as a separate task when needed.

Options:

  1. Accept pull request as is, and open task for protobuf conversion as needed
  2. Reject pull request as unnecessary
  3. Re-write pull request to use a protobuf version of file-centric schema

Brian K.

Reply-To: ohsu-computational-biology/bioschemas reply@reply.github.com<mailto:reply@reply.github.com> Date: Wednesday, January 4, 2017 at 10:22 AM To: ohsu-computational-biology/bioschemas bioschemas@noreply.github.com<mailto:bioschemas@noreply.github.com> Cc: Brian King kibri@ohsu.edu<mailto:kibri@ohsu.edu>, Author author@noreply.github.com<mailto:author@noreply.github.com> Subject: Re: [ohsu-computational-biology/bioschemas] Icgc (#5)

If the schema is internal and project specific, what's the value of adding it to bioschemas? That is a decent question. Perhaps added prematurely? Not sure. BrianK?

If we support jsonschema and protobuf, can we commit to supporting those for every schema? I thought:

  • we had settled on protobuf for our schemas
  • the primary focus was to act as a versioned clearing house for the multitude of downstream dependencies, freeing projects from having mange [copy-and-paste, git-submodule,etc.] The choice of schema downstream dependecies
  • don't recall a discussion or use-case for any-to-any schema conversion or re-writes

-Brian Walsh

On Wed, Jan 4, 2017 at 10:14 AM, Alex Buchanan notifications@github.com<mailto:notifications@github.com> wrote:

I think consistency in formats supported by bioschemas is a worthy goal itself. If we support jsonschema and protobuf, can we commit to supporting those for every schema?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270443578, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC6wHZaOkTlExKqL6qA2I7tzxOjBtheks5rO-GJgaJpZM4LCAfm .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270445663, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFbryKrUnS8dWqpaFjpikiy7RIGWFCQWks5rO-OAgaJpZM4LCAfm.

bwalsh commented 7 years ago

One option to go forward is to approve the pull request as is, and convert to protobuf as a separate task when needed.

+1

-Brian Walsh

On Wed, Jan 4, 2017 at 10:30 AM, Brian King notifications@github.com wrote:

One purpose for having the file centric-schema is so that we can specify and test software that updates the DCC portal "repository" index. I think having a schema of some sort is helpful, and bioschemas seems the natural place to put it. I don’t see a use case for the protobuf version. One option to go forward is to approve the pull request as is, and convert to protobuf as a separate task when needed.

Options:

  1. Accept pull request as is, and open task for protobuf conversion as needed
  2. Reject pull request as unnecessary
  3. Re-write pull request to use a protobuf version of file-centric schema

Brian K.

Reply-To: ohsu-computational-biology/bioschemas <reply@reply.github.com< mailto:reply@reply.github.com>> Date: Wednesday, January 4, 2017 at 10:22 AM To: ohsu-computational-biology/bioschemas <bioschemas@noreply.github.com mailto:bioschemas@noreply.github.com> Cc: Brian King kibri@ohsu.edu<mailto:kibri@ohsu.edu>, Author < author@noreply.github.commailto:author@noreply.github.com> Subject: Re: [ohsu-computational-biology/bioschemas] Icgc (#5)

If the schema is internal and project specific, what's the value of adding it to bioschemas? That is a decent question. Perhaps added prematurely? Not sure. BrianK?

If we support jsonschema and protobuf, can we commit to supporting those for every schema? I thought:

  • we had settled on protobuf for our schemas
  • the primary focus was to act as a versioned clearing house for the multitude of downstream dependencies, freeing projects from having mange [copy-and-paste, git-submodule,etc.] The choice of schema downstream dependecies
  • don't recall a discussion or use-case for any-to-any schema conversion or re-writes

-Brian Walsh

On Wed, Jan 4, 2017 at 10:14 AM, Alex Buchanan <notifications@github.com< mailto:notifications@github.com>> wrote:

I think consistency in formats supported by bioschemas is a worthy goal itself. If we support jsonschema and protobuf, can we commit to supporting those for every schema?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5# issuecomment-270443578, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAC6wHZaOkTlExKqL6qA2I7tzxOjBtheks5rO-GJgaJpZM4LCAfm .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ ohsu-computational-biology/bioschemas/pull/5#issuecomment-270445663, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ AFbryKrUnS8dWqpaFjpikiy7RIGWFCQWks5rO-OAgaJpZM4LCAfm.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270447637, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC6wA1LdzXJTB0VHDCRWblfyg5TJwT8ks5rO-VcgaJpZM4LCAfm .