Closed ghost closed 7 years ago
$ git checkout icgc
Branch icgc set up to track remote branch icgc from origin.
Switched to a new branch 'icgc'
RJHB392:bioschemas walsbr$ cd bin
RJHB392:bin walsbr$ ./package-all.sh
proto copied into ../bioschemas/snapshot/proto/bmeg
proto copied into ../bioschemas/snapshot/proto/ga4gh
cerberus code snapshot into ../bioschemas/snapshot/cerberus/bmeg
cerberus code snapshot into ../bioschemas/snapshot/cerberus/ga4gh
cerberus code moved into ../bioschemas/snapshot/cerberus/gdc
jsonschema code generated into ../bioschemas/snapshot/jsonschema/bmeg
jsonschema code generated into ../bioschemas/snapshot/jsonschema/ga4gh
jsonschema code generated into ../bioschemas/snapshot/jsonschema/gdc
running test
running egg_info
writing requirements to Bioschemas.egg-info/requires.txt
writing Bioschemas.egg-info/PKG-INFO
writing top-level names to Bioschemas.egg-info/top_level.txt
writing dependency_links to Bioschemas.egg-info/dependency_links.txt
reading manifest file 'Bioschemas.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'Bioschemas.egg-info/SOURCES.txt'
running build_ext
bioschemas_tests.test_should_return_path ... ok
bioschemas_tests.test_paths_should_have_proto ... ok
bioschemas_tests.test_should_jsonschema ... ok
bioschemas_tests.test_should_cerberus_schema ... ok
bioschemas_tests.test_should_return_git_hashes ... ok
bioschemas_tests.test_should_have_gdc_submission_templates ... ok
bioschemas_tests.test_should_return_submission_template_by_type ... ok
bioschemas_tests.test_file_centric_is_valid ... ERROR
======================================================================
ERROR: bioschemas_tests.test_file_centric_is_valid
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/Users/walsbr/bioschemas/bioschemas/tests/bioschemas_tests.py", line 40, in test_file_centric_is_valid
schema = bioschemas.json_schema('file-centric')
File "/Users/walsbr/bioschemas/bioschemas/__init__.py", line 59, in json_schema
return json_schemas['definitions'][key]
KeyError: 'file-centric'
----------------------------------------------------------------------
Ran 8 tests in 0.104s
FAILED (errors=1)
It would be a good idea if https://github.com/ohsu-computational-biology/bioschemas/issues/2 was included in this.
Also, the git hashes should be updated w/ the icgc hash, and the test enhanced to ensure all hashes appear.
checking in, how is this going?
@bwalsh, the icgc schema isn't in a separate repository, so I don't have a git hash for it. how do you want to handle that? should I use the bioschemas hash?
@bwalsh @mayfielg I added the hash shapshot and fixed the error in packaging. please look again.
My preference would be that all of the schema were in the sample format. Would it be possible to switch the json-schema over to protobuf? For these sets of messages, the protobuf would be something like:
syntax = "proto3";
package icgc;
message DataBundle {
string data_bundle_id = 1;
}
message AnalysisMethod {
string analysis_type = 1;
string software = 2;
}
message DataCategorization {
string data_type = 1;
string experimental_strategy = 2;
}
message ReferenceGenome {
string genome_build = 1;
string reference_name = 2;
string download_url = 3;
}
message IndexFile {
string id = 1;
string object_id = 2;
string file_name = 3;
string file_format = 4;
int64 file_size = 5;
string file_md5sum = 6;
string repo_file_id = 7;
}
message CopiedFile {
string file_name = 1;
int64 file_size = 2;
string file_md5sum = 3;
int64 last_modified = 4;
IndexFile index_file = 5;
}
message OtherIdentifiers {
string tcga_participant_barcode = 1;
repeated string tcga_sample_barcode = 2;
repeated string tcga_aliquot_barcode = 3;
}
message Donor {
string project_code = 1;
string program = 2;
string study = 3;
string primary_site = 4;
string donor_id = 5;
repeated string specimen_id = 6;
repeated string specimen_type = 7;
repeated string sample_id = 8;
string matched_control_sample_id = 9;
string submitted_donor_id = 10;
string submitted_specimen_id = 11;
string submitted_sample_id = 12;
OtherIdentifiers other_identifiers = 13;
}
message FileCentric {
string id = 1;
string object_id = 2;
repeated string study = 3;
string access = 4;
DataBundle data_bundle = 5;
AnalysisMethod analysis_method = 6;
DataCategorization data_categorization = 7;
ReferenceGenome reference_genome = 8;
repeated CopiedFile files_copied = 9;
string repo_data_bundle_id = 10;
string repo_file_id = 11;
repeated string repo_data_set_ids = 12;
string repo_type = 13;
string repo_org = 14;
string repo_name = 15;
string repo_code = 16;
string repo_country = 17;
string repo_base_url = 18;
string repo_data_path = 19;
string repo_metadata_path = 20;
repeated Donor donors = 21;
}
Tests look good.
$ git status
On branch icgc
Your branch is up-to-date with 'origin/icgc'.
$ cd bin
$ ./package-all.sh
proto copied into ../bioschemas/snapshot/proto/bmeg
proto copied into ../bioschemas/snapshot/proto/ga4gh
cerberus code snapshot into ../bioschemas/snapshot/cerberus/bmeg
cerberus code snapshot into ../bioschemas/snapshot/cerberus/ga4gh
cerberus code moved into ../bioschemas/snapshot/cerberus/gdc
mkdir: ../bioschemas/snapshot/jsonschema: File exists
jsonschema code generated into ../bioschemas/snapshot/jsonschema/bmeg
jsonschema code generated into ../bioschemas/snapshot/jsonschema/ga4gh
jsonschema code generated into ../bioschemas/snapshot/jsonschema/gdc
running test
Searching for nose
Best match: nose 1.3.7
Processing nose-1.3.7-py2.7.egg
Using /Users/walsbr/bioschemas/.eggs/nose-1.3.7-py2.7.egg
running egg_info
writing requirements to Bioschemas.egg-info/requires.txt
writing Bioschemas.egg-info/PKG-INFO
writing top-level names to Bioschemas.egg-info/top_level.txt
writing dependency_links to Bioschemas.egg-info/dependency_links.txt
reading manifest file 'Bioschemas.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'Bioschemas.egg-info/SOURCES.txt'
running build_ext
bioschemas_tests.test_should_return_path ... ok
bioschemas_tests.test_paths_should_have_proto ... ok
bioschemas_tests.test_should_jsonschema ... ok
bioschemas_tests.test_should_cerberus_schema ... ok
bioschemas_tests.test_should_return_git_hashes ... ok
bioschemas_tests.test_should_have_gdc_submission_templates ... ok
bioschemas_tests.test_should_return_submission_template_by_type ... ok
bioschemas_tests.test_file_centric_is_valid ... ok
bioschemas_tests.test_file_centric_validation ... ok
----------------------------------------------------------------------
Ran 9 tests in 0.062s
OK
Kyle, I'm going to try to make a protobuf that compiles to the icgc-dcc JSON.
@kellrott Agree that all kafka messages should be proto. The file-centric jsonschema is an internal schema used by the dcc-portal. The intent of exposing it here is as prep for the add-a-file story in our backlog.
The document represented by this schema would be manipulated by theeuler.todo
(not yet written) component below as part of it's CRUD operations with elastic. It is not intended to be pushed on the wire.
More info on the dataflows
Brian W., so you're saying we shouldn't try for a protobuf version of file-centric JSON?
Yes. Don't see the need short-to-mid-term (don't see us sending internal dcc structures on the wire to other systems). Will know more after we implement add-a-file. Reconsider then.
-Brian Walsh
On Wed, Jan 4, 2017 at 9:56 AM, Brian King notifications@github.com wrote:
Brian W., so you're saying we shouldn't try for a protobuf version of file-centric JSON?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270438865, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC6wDesqDM54sagfueJIBLJW71oEzKlks5rO91QgaJpZM4LCAfm .
If the schema is internal and project specific, what's the value of adding it to bioschemas? Does it need to be shared by multiple codebases and this is the most convenient way to achieve that?
I think consistency in formats supported by bioschemas is a worthy goal itself. If we support jsonschema and protobuf, can we commit to supporting those for every schema?
If the schema is internal and project specific, what's the value of adding it to bioschemas? That is a decent question. Perhaps added prematurely? Not sure. BrianK?
If we support jsonschema and protobuf, can we commit to supporting those for every schema? I thought:
- we had settled on protobuf for our schemas
- the primary focus was to act as a versioned clearing house for the multitude of downstream dependencies, freeing projects from having mange [copy-and-paste, git-submodule,etc.] The choice of schema downstream dependecies
- don't recall a discussion or use-case for any-to-any schema conversion or re-writes
-Brian Walsh
On Wed, Jan 4, 2017 at 10:14 AM, Alex Buchanan notifications@github.com wrote:
I think consistency in formats supported by bioschemas is a worthy goal itself. If we support jsonschema and protobuf, can we commit to supporting those for every schema?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270443578, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC6wHZaOkTlExKqL6qA2I7tzxOjBtheks5rO-GJgaJpZM4LCAfm .
One purpose for having the file centric-schema is so that we can specify and test software that updates the DCC portal "repository" index. I think having a schema of some sort is helpful, and bioschemas seems the natural place to put it. I don’t see a use case for the protobuf version. One option to go forward is to approve the pull request as is, and convert to protobuf as a separate task when needed.
Options:
Brian K.
Reply-To: ohsu-computational-biology/bioschemas reply@reply.github.com<mailto:reply@reply.github.com> Date: Wednesday, January 4, 2017 at 10:22 AM To: ohsu-computational-biology/bioschemas bioschemas@noreply.github.com<mailto:bioschemas@noreply.github.com> Cc: Brian King kibri@ohsu.edu<mailto:kibri@ohsu.edu>, Author author@noreply.github.com<mailto:author@noreply.github.com> Subject: Re: [ohsu-computational-biology/bioschemas] Icgc (#5)
If the schema is internal and project specific, what's the value of adding it to bioschemas? That is a decent question. Perhaps added prematurely? Not sure. BrianK?
If we support jsonschema and protobuf, can we commit to supporting those for every schema? I thought:
- we had settled on protobuf for our schemas
- the primary focus was to act as a versioned clearing house for the multitude of downstream dependencies, freeing projects from having mange [copy-and-paste, git-submodule,etc.] The choice of schema downstream dependecies
- don't recall a discussion or use-case for any-to-any schema conversion or re-writes
-Brian Walsh
On Wed, Jan 4, 2017 at 10:14 AM, Alex Buchanan notifications@github.com<mailto:notifications@github.com> wrote:
I think consistency in formats supported by bioschemas is a worthy goal itself. If we support jsonschema and protobuf, can we commit to supporting those for every schema?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270443578, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC6wHZaOkTlExKqL6qA2I7tzxOjBtheks5rO-GJgaJpZM4LCAfm .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270445663, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFbryKrUnS8dWqpaFjpikiy7RIGWFCQWks5rO-OAgaJpZM4LCAfm.
One option to go forward is to approve the pull request as is, and convert to protobuf as a separate task when needed.
+1
-Brian Walsh
On Wed, Jan 4, 2017 at 10:30 AM, Brian King notifications@github.com wrote:
One purpose for having the file centric-schema is so that we can specify and test software that updates the DCC portal "repository" index. I think having a schema of some sort is helpful, and bioschemas seems the natural place to put it. I don’t see a use case for the protobuf version. One option to go forward is to approve the pull request as is, and convert to protobuf as a separate task when needed.
Options:
- Accept pull request as is, and open task for protobuf conversion as needed
- Reject pull request as unnecessary
- Re-write pull request to use a protobuf version of file-centric schema
Brian K.
Reply-To: ohsu-computational-biology/bioschemas <reply@reply.github.com< mailto:reply@reply.github.com>> Date: Wednesday, January 4, 2017 at 10:22 AM To: ohsu-computational-biology/bioschemas <bioschemas@noreply.github.com mailto:bioschemas@noreply.github.com> Cc: Brian King kibri@ohsu.edu<mailto:kibri@ohsu.edu>, Author < author@noreply.github.commailto:author@noreply.github.com> Subject: Re: [ohsu-computational-biology/bioschemas] Icgc (#5)
If the schema is internal and project specific, what's the value of adding it to bioschemas? That is a decent question. Perhaps added prematurely? Not sure. BrianK?
If we support jsonschema and protobuf, can we commit to supporting those for every schema? I thought:
- we had settled on protobuf for our schemas
- the primary focus was to act as a versioned clearing house for the multitude of downstream dependencies, freeing projects from having mange [copy-and-paste, git-submodule,etc.] The choice of schema downstream dependecies
- don't recall a discussion or use-case for any-to-any schema conversion or re-writes
-Brian Walsh
On Wed, Jan 4, 2017 at 10:14 AM, Alex Buchanan <notifications@github.com< mailto:notifications@github.com>> wrote:
I think consistency in formats supported by bioschemas is a worthy goal itself. If we support jsonschema and protobuf, can we commit to supporting those for every schema?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5# issuecomment-270443578, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAC6wHZaOkTlExKqL6qA2I7tzxOjBtheks5rO-GJgaJpZM4LCAfm .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ ohsu-computational-biology/bioschemas/pull/5#issuecomment-270445663, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ AFbryKrUnS8dWqpaFjpikiy7RIGWFCQWks5rO-OAgaJpZM4LCAfm.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohsu-computational-biology/bioschemas/pull/5#issuecomment-270447637, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC6wA1LdzXJTB0VHDCRWblfyg5TJwT8ks5rO-VcgaJpZM4LCAfm .
Added schema for the file-centric document type used in DCC elasticsearch repository index.