compomics / compomics-utilities

Open source Java library for computational proteomics
http://compomics.github.io/projects/compomics-utilities.html
29 stars 17 forks source link

Filter by taxonomy #20

Closed ypriverol closed 6 years ago

ypriverol commented 7 years ago

@hbarsnes I want to use the peptide mapper for all the datasets in PRIDE that the protein do not exist anymore in UniProt. I would like to use the following way programmatically:

Peptide Sequence , Taxonomy -> Protein sequences where that contains this peptide. Can you provide me an example?

Regards Yasset

hbarsnes commented 7 years ago

Hi Yasset,

I think this questions would be better addressed at @dominik-kopczynski, he should be able to easily give you the required code examples. I was assuming that this was included in the PeptideMapper wiki, but this does not seem the case. So we should add it there as well. @dominik-kopczynski I assume you can take care of updating the wiki as well?

In the meantime you may get some ideas on how to do what you want based on the following class. This shows how the mapper is used via the command line, but if you look at the code it should show you what you need to do in your own code (and is perhaps easier to understand than if going into the PeptideShaker code).

BTW, I assume that you by taxonomy mean a FASTA file right? As we cannot automatically map to a taxonomy. I know that @mvaudel wants to look into such an approach, but this is currently only at the idea stage as far as I know.

Best regards, Harald

ypriverol commented 7 years ago

@hbarsnes the taxonomy means that some fasta file contains the information of the taxonomy in the description (for example uniprot). Would be nice if this information is available at protein level, been able to filter by that.

I will take a look to the example.

hbarsnes commented 7 years ago

@ypriverol You mean as part of the protein sequence headers? As in having a FASTA file that contains both SwissProt and TrEMBL sequences and you would then like to, for example, tell the peptide mapper to only map to the SwissProt entries? If yes, wouldn't it be much easier to make a FASTA file containing only the SwissProt entries and map to that one directly?

mvaudel commented 7 years ago

Hi Yasset,

I think the easiest way to do that is to index all Uniprot, query a peptide sequence, get all protein accessions, and extract the taxonomy of every protein from the header. Are you doing that with utilities as dependency? I can write you an example in the wiki if you like :)

Best regards,

Marc

dominik-kopczynski commented 7 years ago

Hi Yasset,

Here is my suggestion for a solution for you. Currently, we do not specify a taxonomy for the search, hence we suggest a filtering of the results. Here is some code snippet. If you have problems to integrate it into your code, please let me know.

Cheers, Dominik

void myfunction(){ // loading all input data ArrayList peptides = new ArrayList();

// [fill peptides list]
File fastaFile = new File([FASTA FILE]);
SequenceFactory sequenceFactory = SequenceFactory.getInstance();
try {
    sequenceFactory.loadFastaFile(fastaFile, waitingHandlerCLIImpl);
} catch (Exception e) {
    System.err.println("Error: cound not open FASTA file");
    System.exit(-1);
}

ArrayList<PeptideProteinMapping> allPeptideProteinMappings = new ArrayList<PeptideProteinMapping>();

// initializing all necessary parameters for the mapping
WaitingHandlerCLIImpl waitingHandlerCLIImpl = new WaitingHandlerCLIImpl();
SearchParameters searchParameters = new SearchParameters();
searchParameters.setPtmSettings(new PtmSettings());
searchParameters.setFragmentIonAccuracy(0.02);
searchParameters.setFragmentAccuracyType(SearchParameters.MassAccuracyType.DA);

sequenceMatchingPreferences = new SequenceMatchingPreferences();
sequenceMatchingPreferences.setSequenceMatchingType(SequenceMatchingPreferences.MatchingType.indistiguishableAminoAcids);
sequenceMatchingPreferences.setLimitX(0.25);

PeptideVariantsPreferences peptideVariantsPreferences = PeptideVariantsPreferences.getNoVariantPreferences();

// initializing the mapper
PeptideMapper peptideMapper = new FMIndex(waitingHandlerCLIImpl, true, peptideVariantsPreferences, searchParameters);

// start mapping
for (int i = 0; i < peptides.size(); ++i){

    ArrayList<PeptideProteinMapping> peptideProteinMappings = peptideMapper.getProteinMapping(peptides.get(i), sequenceMatchingPreferences);
    for (int j = 0; j < peptideProteinMappings.size(); ++j){

        String accession = peptideProteinMappings.get(j).getProteinAccession();
        // taxonomy filtering
        if (sequenceFactory.getHeader(accession).getTaxonomy().contains([DESIRED TAXONOMY e.g. 'Homo sapiens']){
            allPeptideProteinMappings.add(peptideProteinMappings.get(j));
        }

    }

}

}

ypriverol commented 7 years ago

@dominik-kopczynski Thanks I will try in my side and let you know.

Regards Yasset

hbarsnes commented 7 years ago

@ypriverol Any progress on this one? Can we close the issue?