Arcadia-Science / ProteinCartography

a pipeline to build similarity maps of protein space
MIT License
30 stars 9 forks source link

How do we know which additional uniprot metadata features can be fetched by `fetch_uniprot_metadata.py`? #90

Open taylorreiter opened 2 weeks ago

taylorreiter commented 2 weeks ago

Description of feature

I see that the following code is used to fetch uniprot metadata fields.

        python ProteinCartography/fetch_uniprot_metadata.py \
            --input {input} \
            --output {output.uniprot_features} \
            --additional-fields {UNIPROT_ADDITIONAL_FIELDS}

Would you be willing to document how to provide additional fields?

I see in the script that:

REQUIRED_FIELDS_DICT = {
    "Entry": "accession",
    "Entry Name": "id",
    "Protein names": "protein_name",
    "Gene Names (primary)": "gene_primary",
    "Annotation": "annotation_score",
    "Organism": "organism_name",
    "Taxonomic lineage": "lineage",
    "Length": "length",
    "Fragment": "fragment",
    "Sequence": "sequence",
}
OTHER_FIELDS_DICT = {
    "Reviewed": "reviewed",
    "Gene Names": "gene_names",
    "Protein existence": "protein_existence",
    "Sequence version": "sequence_version",
    "RefSeq": "xref_refseq",
    "GeneID": "xref_geneid",
    "EMBL": "xref_embl",
    "AlphaFoldDB": "xref_alphafolddb",
    "PDB": "xref_pdb",
    "Pfam": "xref_pfam",
    "InterPro": "xref_interpro",
}

but I'm not sure:

braebigge commented 2 weeks ago

Thanks for pointing this out, Taylor! We'll add more documentation to cover these questions.

In the meantime, all of the available fields can be found here. I tried it out for the signal peptide field, ft_signal, using the following command line argument:

ProteinCartography/fetch_uniprot_metadata.py -i test/proteins.txt -o test/output/uniprot_features.tsv -a ft_signal

If you want to do 2 or more fields, you can separate them with a comma (but no space) like you see here:

ProteinCartography/fetch_uniprot_metadata.py -i test/proteins.txt -o test/output/uniprot_features.tsv -a ft_signal,ft_act_site,ft_transmem

Both of these commands worked for me, so I assume you can use any fields that UniProt has to offer, but let me know if you run into any that give you errors!

taylorreiter commented 2 weeks ago

Thanks so much @braebigge! This looks perfect but I'll let you know if any edge cases come up for me