elixir-europe / plant-brapi-to-isa

BSD 3-Clause "New" or "Revised" License
8 stars 6 forks source link

NCBI Taxon ID optimalisation #54

Closed bedroesb closed 5 years ago

bedroesb commented 5 years ago

New format:

Source Name Characteristics[Organism] Term Source REF Term Accession Number Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution]
Cork oak Barradas daSerra 03 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 Quercus suber INIAV:BS03 Growth BS3 plantnumber [block]1; [plot]1; [plant]BS3; [replicate]1
Corkoak Barradas da Serra 04 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 Quercus suber INIAV:BS04 Growth BS4 plantnumber [block]1; [plot]1; [plant]BS4; [replicate]2
Corkoak Barradas da Serra 05 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 Quercus suber INIAV:BS05 Growth BS5 plantnumber [block]1; [plot]1; [plant]BS5; [replicate]3
Corkoak Barradas da Serra 06 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 Quercus suber INIAV:BS06 Growth BS6 plantnumber [block]1; [plot]1; [plant]BS6; [replicate]4
Source Name Characteristics[Organism] Term Source REF Term Accession Number Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Characteristics[Material Source DOI] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[fruit load]
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301054 plant [X]1054; [plot]0; [plant]29301054; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301618 plant [X]1618; [plot]0; [plant]29301618; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301030 plant [X]1030; [plot]0; [plant]29301030; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29302127 plant [X]2127; [plot]0; [plant]29302127; [replicate]1 low (pruned till one fruit)

generated with:

python brapi_to_isa.py -e https://brapi.biodata.pt/brapi/v1/ -t 2
python brapi_to_isa.py -e https://www.eu-sol.wur.nl/webapi/tomato/brapi/v1/ -t 2
proccaserra commented 5 years ago

@bedroesb nice one, you beat to it! I was about to push the changes. relates to https://github.com/MIAPPE/ISA-Tab-for-plant-phenotyping/issues/17 @PapoutsoglouE

bedroesb commented 5 years ago

Well it was a small thingy so sorry about that ;) I don't really see a problem with the if block to be honest.

I know that I need to change the block when taxonIDs are given through BrAPI.

proccaserra commented 5 years ago

@bedroesb if the if block, in order to be consistent, I think we need to make sure we use a similar pattern: so: if 'taxonId' in all_germplasm_attributes and all_germplasm_attributes['taxonId']: taxonids =[] organism = "multiple organisms" ncbitaxon = OntologySource(name='NCBITaxon', description="NCBI Taxonomy") for taxonid in all_germplasm_attributes['taxonId']: taxonids.append(att_test(taxonid, 'sourceName', 'NCBI') + ":" + str(taxonid['taxonId'])) c = self.create_isa_characteristic('Organism', organism, ';'.join(taxonids),ncbitaxon.name,';'.join(taxonids)) returned_characteristics.append(c)

sorry didn't test

bedroesb commented 5 years ago

The attribute taxonId looks like this:

        "taxonIds": [
            {
                "sourceName": "ncbiTaxon",
                "taxonId": "2340"
            },
            {
                "sourceName": "ciradTaxon",
                "taxonId": "E312"
            }
        ],

So the problem is how to handle the URI when it is not a NCBI taxon.

If I assume it is always NCBI taxon ID, than it is an easy thing to implement indeed

bedroesb commented 5 years ago

I guess we can just look for a sourceName == ncbiTaxon, and than take the one that is delivered by taxonId, otherwise use the implementation (using the genus and species)

proccaserra commented 5 years ago

right but I can't remember now of top of my head if that situation (multiple taxonIds) occurs when there is one species+genus and the multiple taxonIds refer to a listing of 'alternate identifiers' for the same organism or if it corresponds to defined a hybrid organism where it is necessary to list all the different taxons from the parents lines.

either way, concatenation resulting from the multiple entries will not be necessarily pretty in a tabular format.

bedroesb commented 5 years ago

true that, I am changing it

bedroesb commented 5 years ago

@proccaserra I've made a new function to make things more logic.

I will add some documentation to it

bedroesb commented 5 years ago

WUR endpoint delivered the URI link as taxonId, while the Portuguese one gave the NCBI ID itself, but this is handled in the script now.

PapoutsoglouE commented 5 years ago

Take another look at the crosslinked issue on the MIAPPE side. I am not sure that this is the best option, so let's still consider some alternatives!

bedroesb commented 5 years ago

So you propose an extra column called Characteristics[NCBI] with the NCBI id ? Not a problem at all to implement

bedroesb commented 5 years ago

Does this look like a good output?:

VIB:

Source Name Characteristics[NCBI] Term Source REF Term Accession Number Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[water regimen]0 Factor Value[water regimen]1
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_13 plant [plant]13 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_27 plant [plant]27 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_24 plant [plant]24 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_3 plant [plant]3 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS
OE-2-1 Arabidopsis thaliana NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/3702 NCBI:3702 Arabidopsis thaliana Growth pot_17 plant [plant]17 jobau_wellwatered_3-9DAS jobau_wellwatered_3-9DAS

PT:

Source Name Characteristics[NCBI] Term Source REF Term Accession Number Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution]
Cork oak Barradas daSerra 03 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS03 Growth BS3 plantnumber [block]1; [plot]1; [plant]BS3; [replicate]1
Corkoak Barradas da Serra 04 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS04 Growth BS4 plantnumber [block]1; [plot]1; [plant]BS4; [replicate]2
Corkoak Barradas da Serra 05 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS05 Growth BS5 plantnumber [block]1; [plot]1; [plant]BS5; [replicate]3
Corkoak Barradas da Serra 06 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS06 Growth BS6 plantnumber [block]1; [plot]1; [plant]BS6; [replicate]4
Corkoak Barradas da Serra 07 Quercus suber NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/58331 NCBI:58331 Quercus suber INIAV:BS07 Growth BS7 plantnumber [block]1; [plot]1; [plant]BS7; [replicate]5

WUR

Source Name Characteristics[NCBI] Term Source REF Term Accession Number Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Characteristics[Material Source DOI] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[fruit load]
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29302110 plant [X]2110; [plot]0; [plant]29302110; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301054 plant [X]1054; [plot]0; [plant]29301054; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301824 plant [X]1824; [plot]0; [plant]29301824; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29302127 plant [X]2127; [plot]0; [plant]29302127; [replicate]1 low (pruned till one fruit)
S. lycopersicum cv. M82 Solanum lycopersicum NCBITaxon http://purl.bioontology.org/ontology/NCBITAXON/4081 NCBI:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301317 plant [X]1317; [plot]0; [plant]29301317; [replicate]1 low (pruned till one fruit)

@PapoutsoglouE

DanFaria commented 5 years ago

Please check my post on the related issue on the MIAPPE github: https://github.com/MIAPPE/ISA-Tab-for-plant-phenotyping/issues/17#issuecomment-524937373

If the goal is for BrAPI2ISA to generate MIAPPE-compliant ISA-Tab, then what I said there holds here as well. We should not be modeling Organism in a way that differs from the MIAPPE 1.1 checklist, even if that means we cannot use some of the functionalities from ISA.

bedroesb commented 5 years ago

@DanFaria
So if I am following correctly, it will stay the same as it was (so without the

Characteristics[NCBI] Term Source REF Term Accession Number

columns)

But with NCBITAXON:xxxx instead of NCBI:xxxx, for the Characteristics[Organism] column.

DanFaria commented 5 years ago

@bedroesb Yes, I think that is the best solution, as I don't see a way to improve functionality on the ISA side without deviating from the MIAPPE checklist. I would give it a couple of days to see if anyone expresses a different opinion on the pending MIAPPE ISA-Tab issue, but after that, I think you can go ahead with that configuration.

Eliana has already posted an issue on the MIAPPE checklist to update the NCBI prefix to NCBITAXON, and hopefully that can be done still within the MIAPPE 1.1 release, as it is a non-functional change.

proccaserra commented 5 years ago

@bedroesb @DanFaria I guess the ambiguity lies in the fact that for MIAPPE organism, an identifier is expected, where intuitively an organism name would be supplied (following the pattern for Genus and Species.

so may be a minor change would be to use 'organism ID' in both MIAPPE and the ISA configuration to remove that uncertainty.

DanFaria commented 5 years ago

so may be a minor change would be to use 'organism ID' in both MIAPPE and the ISA configuration to remove that uncertainty.

I agree that this would make the field more intuitive. I'll raise the issue on the MIAPPE checklist, and if approved, we can update the ISA configuration.

bedroesb commented 5 years ago

WUR:

Source Name Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Characteristics[Material Source DOI] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[fruit load]
S. lycopersicum cv. M82 NCBITAXON:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301824 plant X:1824;plot:0;plant:29301824;replicate:1 low (pruned till one fruit)
S. lycopersicum cv. M82 NCBITAXON:4081 Solanum lycopersicum EA10004 https://www.eu-sol.wur.nl/rdf/accession/EA10004 Growth 29301642 plant X:1642;plot:0;plant:29301642;replicate:1 low (pruned till one fruit)

Pt:

Source Name Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Characteristics[Material Source ID] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution]
Cork oak Barradas daSerra 03 NCBITAXON:58331 Quercus suber INIAV:BS03 Growth BS3 plantnumber block:1;plot:1;plant:BS3;replicate:1
Corkoak Barradas da Serra 04 NCBITAXON:58331 Quercus suber INIAV:BS04 Growth BS4 plantnumber block:1;plot:1;plant:BS4;replicate:2

VIB:

Source Name Characteristics[Organism] Characteristics[Genus] Characteristics[Species] Protocol REF Sample Name Characteristics[Observation Unit Type] Characteristics[Spatial Distribution] Factor Value[water regimen]
OE-2-1 NCBITAXON:3702 Arabidopsis thaliana Growth pot_10 plant plant:10 jobau_wellwatered_10-21DAS,jobau_wellwatered_3-9DAS
OE-2-1 NCBITAXON:3702 Arabidopsis thaliana Growth pot_24 plant plant:24 jobau_drought_10-21DAS,jobau_wellwatered_3-9DAS

Of which the VIB one has the solved treatments problem mentioned before

PapoutsoglouE commented 5 years ago

Off the top of my head, I don't recall any of the WUR germplasm having S. lycopersicum in their name/ID. I am also unsure where the cv. M82 came from.
@bedroesb, could you elaborate on how the Source Name is formed in this case? (I may be misremembering and there might indeed be germplasm with that information)

(Also, the format for Spatial Distribution has been changed from using square brackets to colons, i.e. from [block] 1;[plot] 2 to block:1;plot:2.)

PapoutsoglouE commented 5 years ago

I double checked, and indeed our database has some entries with that germplasm name. Apologies!

bedroesb commented 5 years ago

No problem! I just updated the examples in my previous post with the latest code changes concerning Characteristics[Spatial Distribution]