NCBI-Hackathons / Metadata_categorization

A crowdsourcing/expert curation platform for metadata categorization.
Creative Commons Zero v1.0 Universal
5 stars 0 forks source link

annotDisease field missing in Solr documents #5

Closed eweitz closed 8 years ago

eweitz commented 8 years ago

Each BioSample record in our application is represented as a Solr document. Each underlying field in those documents typically has a "source" value and a corresponding "annotated" value, e.g. sourceCellLine and annotCellLine.

However, the sourceDisease field is missing a corresponding annotDisease field.

http://localhost:8983/solr/annotation/select?q=id%3A3274314%0A&wt=json&indent=true

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "indent":"true",
      "q":"id:3274314\n",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "queueId":101,
        "id":"3274314",
        "sourceCellLine":"HeLa",
        "sampleName":"0",
        "sampleTitle":"0",
        "sourceCellType":"0",
        "sourceSpecies":"Homo Sapiens",
        "sourceAnatomy":"0",
        "sourceDisease":"  ",
        "sourceCellTreatment":"0",
        "annotCellLine":"0",
        "annotCellType":"0",
        "annotSpecies":"0",
        "annotAnatomy":"0",
        "annotCellTreatment":"0",
        "_version_":1524536369343365120}]
  }}

The doc above should contain a field annotDisease.

(Also, sourceDisease should follow our convention of indicating empty fields via "0", instead of " " Finding a better way to represent empty fields would be nice, but is a separate issue.)

lepons commented 8 years ago

latest fixes to /annotation core:

  1. add annotDisease field
  2. fix sourceDisease field so that defaults to "0" instead of " "
  3. added taxId and populated for human samples to 9606
eweitz commented 8 years ago

It seems the "annotation" core has regressed. annotDisease is missing in Solr documents from that core:

From http://localhost:8983/solr/annotation/select?q=_%3A_&wt=json&indent=true

{
        "queueId":4,
        "taxId":9606,
        "note":"0",
        "id":"2712518",
        "sourceCellLine":"Sample from Homo sapiens",
        "sampleName":"0",
        "sampleTitle":"0",
        "sourceCellType":"0",
        "sourceSpecies":"Homo Sapiens",
        "sourceAnatomy":"0",
        "sourceDisease":"0",
        "sourceCellTreatment":"0",
        "sourceSex":"0",
        "annotSex":"0",
        "annotCellLine":"0",
        "annotCellType":"0",
        "annotSpecies":"0",
        "annotAnatomy":"0",
        "annotCellTreatment":"0",
        "sourceDevStage":"0",
        "annotDevStage":"0",
        "_version_":1526267270228082688},

Same for AnnotationsDev; see http://localhost:8983/solr/AnnotationsDev/select?q=_%3A_&wt=json&indent=true.

lepons commented 8 years ago

the samples in AnnotationsDev for human are not in great shape - there's lots of layers of stuff that I never refreshed after I ran tests into it, so probably some of them have them in there right. I'm going to take down annotation and repopulate - it's just bad versioning on my part that when I populated it today that I introduced this other problem. annotation should be back up in an hour.

FWIW, if you wanted to see a test case of the clustering approach, in AnnotationsDev, queue 81 has all the HEK239 samples, they're not disambiguated, they are just the free text matches, but seemed like a better place to start than nothing.

On Mon, Feb 15, 2016 at 8:40 PM, Eric Weitz notifications@github.com wrote:

It seems the "annotation" core has regressed. annotDisease is missing in Solr documents from that core:

From http://localhost:8983/solr/annotation/select?q=_%3A_ &wt=json&indent=true

{ "queueId":4, "taxId":9606, "note":"0", "id":"2712518", "sourceCellLine":"Sample from Homo sapiens", "sampleName":"0", "sampleTitle":"0", "sourceCellType":"0", "sourceSpecies":"Homo Sapiens", "sourceAnatomy":"0", "sourceDisease":"0", "sourceCellTreatment":"0", "sourceSex":"0", "annotSex":"0", "annotCellLine":"0", "annotCellType":"0", "annotSpecies":"0", "annotAnatomy":"0", "annotCellTreatment":"0", "sourceDevStage":"0", "annotDevStage":"0", "version":1526267270228082688},

Same for AnnotationsDev; see http://localhost:8983/solr/AnnotationsDev/select?q=_%3A_&wt=json&indent=true.

— Reply to this email directly or view it on GitHub https://github.com/NCBI-Hackathons/Metadata_categorization/issues/5#issuecomment-184469362 .

lepons commented 8 years ago

Annotation is repopulated. I'm going to apply the sorting approach to HEK239 for annotation as well, and then it should be the same, in queue 81, you should see clustered HEK239 samples that all appear in that queue.

On Mon, Feb 15, 2016 at 8:47 PM, Lena Pons lenabethpons@gmail.com wrote:

the samples in AnnotationsDev for human are not in great shape - there's lots of layers of stuff that I never refreshed after I ran tests into it, so probably some of them have them in there right. I'm going to take down annotation and repopulate - it's just bad versioning on my part that when I populated it today that I introduced this other problem. annotation should be back up in an hour.

FWIW, if you wanted to see a test case of the clustering approach, in AnnotationsDev, queue 81 has all the HEK239 samples, they're not disambiguated, they are just the free text matches, but seemed like a better place to start than nothing.

On Mon, Feb 15, 2016 at 8:40 PM, Eric Weitz notifications@github.com wrote:

It seems the "annotation" core has regressed. annotDisease is missing in Solr documents from that core:

From http://localhost:8983/solr/annotation/select?q=_%3A_ &wt=json&indent=true

{ "queueId":4, "taxId":9606, "note":"0", "id":"2712518", "sourceCellLine":"Sample from Homo sapiens", "sampleName":"0", "sampleTitle":"0", "sourceCellType":"0", "sourceSpecies":"Homo Sapiens", "sourceAnatomy":"0", "sourceDisease":"0", "sourceCellTreatment":"0", "sourceSex":"0", "annotSex":"0", "annotCellLine":"0", "annotCellType":"0", "annotSpecies":"0", "annotAnatomy":"0", "annotCellTreatment":"0", "sourceDevStage":"0", "annotDevStage":"0", "version":1526267270228082688},

Same for AnnotationsDev; see http://localhost:8983/solr/AnnotationsDev/select?q=_%3A_&wt=json&indent=true.

— Reply to this email directly or view it on GitHub https://github.com/NCBI-Hackathons/Metadata_categorization/issues/5#issuecomment-184469362 .

lepons commented 8 years ago

The HEK293 cluster is in queue 4. You can view it http://localhost:8983/solr/annotation/select?q=queueId%3A4+AND+HEK293&wt=json&indent=true

"response": { "numFound": 652, "start": 0, "docs": [ { "queueId": 4, "id": "2147292", "taxId": 9606, "sourceCellLine": "HEK293 GMUCT B", "sampleName": "0", "sampleTitle": "0", "sourceCellType": "HEK293 cells", "sourceSpecies": "Homo Sapiens", "sourceAnatomy": "0", "sourceDisease": "0", "annotCellLine": "0", "annotCellType": "0", "annotSpecies": "0", "annotAnatomy": "0", " annotDisease": "0", "annotCellTreatment": "0", "note": "0", "version": 1526295291084406800 }, { "queueId": 4, "id": "2147291", "taxId": 9606, " sourceCellLine": "HEK293 GMUCT A", "sampleName": "0", "sampleTitle": "0", " sourceCellType": "HEK293 cells", "sourceSpecies": "Homo Sapiens", " sourceAnatomy": "0", "sourceDisease": "0", "annotCellLine": "0", " annotCellType": "0", "annotSpecies": "0", "annotAnatomy": "0", "annotDisease": "0", "annotCellTreatment": "0", "note": "0", "version": 1526295291092795400 }, { "queueId": 4, "id": "3301901", "taxId": 9606, " sourceCellLine": "HEK293", "sampleName": "0", "sampleTitle": "0", " sourceCellType": "0", "sourceSpecies": "Homo Sapiens", "sourceAnatomy": "missing", "sourceDisease": "0", "annotCellLine": "0", "annotCellType": "0", "annotSpecies": "0", "annotAnatomy": "0", "annotDisease": "0", " annotCellTreatment": "0", "note": "0", "version": 1526295291101184000 },

On Mon, Feb 15, 2016 at 9:09 PM, Lena Pons lenabethpons@gmail.com wrote:

Annotation is repopulated. I'm going to apply the sorting approach to HEK239 for annotation as well, and then it should be the same, in queue 81, you should see clustered HEK239 samples that all appear in that queue.

On Mon, Feb 15, 2016 at 8:47 PM, Lena Pons lenabethpons@gmail.com wrote:

the samples in AnnotationsDev for human are not in great shape - there's lots of layers of stuff that I never refreshed after I ran tests into it, so probably some of them have them in there right. I'm going to take down annotation and repopulate - it's just bad versioning on my part that when I populated it today that I introduced this other problem. annotation should be back up in an hour.

FWIW, if you wanted to see a test case of the clustering approach, in AnnotationsDev, queue 81 has all the HEK239 samples, they're not disambiguated, they are just the free text matches, but seemed like a better place to start than nothing.

On Mon, Feb 15, 2016 at 8:40 PM, Eric Weitz notifications@github.com wrote:

It seems the "annotation" core has regressed. annotDisease is missing in Solr documents from that core:

From http://localhost:8983/solr/annotation/select?q=_%3A_ &wt=json&indent=true

{ "queueId":4, "taxId":9606, "note":"0", "id":"2712518", "sourceCellLine":"Sample from Homo sapiens", "sampleName":"0", "sampleTitle":"0", "sourceCellType":"0", "sourceSpecies":"Homo Sapiens", "sourceAnatomy":"0", "sourceDisease":"0", "sourceCellTreatment":"0", "sourceSex":"0", "annotSex":"0", "annotCellLine":"0", "annotCellType":"0", "annotSpecies":"0", "annotAnatomy":"0", "annotCellTreatment":"0", "sourceDevStage":"0", "annotDevStage":"0", "version":1526267270228082688},

Same for AnnotationsDev; see http://localhost:8983/solr/AnnotationsDev/select?q=_%3A_&wt=json&indent=true.

— Reply to this email directly or view it on GitHub https://github.com/NCBI-Hackathons/Metadata_categorization/issues/5#issuecomment-184469362 .

eweitz commented 8 years ago

Awesome, annotation is back up with the fixed annotDisease field.

And at a glance, the sourceCellLine clustering looks much better than before in annotation. The clustering seems to be similar in AnnotationsDev.

lepons commented 8 years ago

It's just one cluster but I was wondering if I populated annotationsDev the last time from the wrong (read unsorted) file. On Feb 15, 2016 9:43 PM, "Eric Weitz" notifications@github.com wrote:

Awesome, annotation is back up with the fixed annotDisease field.

And at a glance, the sourceCellLine clustering looks much better than before in annotation. The clustering seems to be similar in AnnotationsDev .

— Reply to this email directly or view it on GitHub https://github.com/NCBI-Hackathons/Metadata_categorization/issues/5#issuecomment-184485770 .

eweitz commented 8 years ago

This was fixed again a few days ago.