Advice on mapping gene locus ID's to protein ID's?

graceleuchtenberger commented 1 month ago

For the byssus genetic markers project that Matt George started, I'm trying to get GO terms and gene ID's for differentially expressed genes that I got through DESeq.

I am in the process of annotating the cds file that we used to index the tag-seq data, but I'm running into a bit of an issue. In my gene count matrix (that I used for DESeq), the listed transcripts with their copy numbers are specified as gene locus ID's (not sure if I'm saying that right), but they look like this: gene-LOC134696364|LOC134696364 Other times it looks like they have an actual idea of the gene: gene-Trnam-cau-162|Trnam-cau

I want to be able to left join my annotation table to my tables of significant genes (originally derived from the gene count matrix using the same identifiers) by the gene ID or transcript ID, but the output I got from blasting my genome against the reviewed UniProt database has identifiers that look like this: lcl|NC_086373.1_cds_XP_063416577.1_1.

So it connects back to the accession number of the overall cds file (the NC number) and gives me a protein ID, but doesn't give me a corresponding location on the genome like HiSat did when Steven used the cds file to index the tag-seq data.

From some googling, I've seen that there are ways to map protein ID's to gene ID's, but these are just loci on a genome that don't necessarily have associated ID's.

I guess my question is, is there a way to get blast to spit out the same loci that HiSat did, so that way I can annotate my significantly expressed genes? Or is there a way to get around my problem in a different way? Thought I'd post an issue in the likely case someone has dealt with this before.

graceleuchtenberger commented 1 month ago

Here's my blast code.

kubu4 commented 1 month ago

Can you please provide links to the various files involved here? Also, can you please provide the output from the head command for those same files?

That will be immensely helpful in getting the layout d the land.

sr320 commented 1 month ago

Can you please provide links to the various files involved here?

In addition to DESeq output / count matrix.. also indicate what fasta files are available- (also showing their format eg head)

sr320 commented 1 month ago

and just so we do not get to off course... code for the alignment used for Deseq2?

graceleuchtenberger commented 1 month ago

I think the cds and fasta files are too big to push to GitHub (on the order of up to 272 MB), but if that assumption is incorrect let me know and I will take them out of gitignore. I've put the file paths here in case you want to use them in Raven.

Head of the fasta I used to make my blastdb, which is all of Swissprot plus some sequences for unreviewed Mytilus foot proteins: Screenshot 2024-10-02 at 10 16 08 AM File path: /home/shared/8TB_HDD_02/graceleuchtenberger/Github/byssus-exp-analysis/data/swissprot_n_mytilusfoot.fasta

Head of the gene count matrix: Screenshot 2024-10-01 at 4 55 22 PM Link here

Head of the cds file used as query file for blast: Screenshot 2024-10-02 at 10 19 41 AM File path: /home/shared/8TB_HDD_02/graceleuchtenberger/Github/byssus-exp-analysis/data/ncbi_dataset/data/GCF_036588685.1/cds_from_genomic.fasta

Here's more of the code: HiSat2 alignment that Steven did

As I'm looking back at that HiSat2 code, I think Steven just used a different fasta file for HiSat (GCF_036588685.1_PNRI_Mtr1.1.1.hap1_genomic.fna) than I did for running the blast. Perhaps that's where I should start. Just missed that when thinking about the problem.

kubu4 commented 1 month ago

Thanks. Also, could you please provide link/head to BLAST output?

As I'm looking back at that HiSat2 code, I think Steven just used a different fasta file for HiSat (GCF_036588685.1_PNRI_Mtr1.1.1.hap1_genomic.fna) than I did for running the blast. Perhaps that's where I should start.

This seems like it.

However, can you please repeat (and/or rephrase) what you're trying to accomplish?

The issue title suggests you're trying to do some ID mapping, but in your initial post you also said

so that way I can annotate my significantly expressed genes?

If you're trying to do the latter, could you just BLAST those genes against SwissProt in order to get annotations?

graceleuchtenberger commented 1 month ago

I think my original thinking was that Steven and I had used the same fasta file for the indexing and BLAST, so I was trying to do some ID mapping to deal with just the differences in what showed up in the geneID or transcriptID column. But now that I know Steven just used a different file, I'll just re-run the blast using the file he used which will most likely generate the same ID's, making it easier to match my annotation table to my significant genes tables.

sr320 commented 1 month ago

@graceleuchtenberger what will you blast?

sr320 commented 1 month ago

Show us the blast output you have, and the head of any other fasta like an rna.fa file.

graceleuchtenberger commented 1 month ago

Here's the head of the blast output file after using the cds file as the query: Screenshot 2024-10-02 at 11 16 50 AM

Here's the head of the fna file used for HiSat that I think I should use for the annotation blast: Screenshot 2024-10-02 at 11 18 34 AM

sr320 commented 1 month ago

Here's the head of the fna file used for HiSat that I think I should use for the annotation blast:

No that it the genome. you should not blast that.

sr320 commented 1 month ago

Here is a analogous issue and solution: https://robertslab.github.io/tusk/modules/04-blast.html#navigating-annotation

can elaborate on more later.

graceleuchtenberger commented 1 month ago

This is super helpful, thank you!

sr320 commented 1 month ago

Here is an attempt to try to sketch out challenges... and solutions https://sr320.github.io/tumbling-oysters/posts/30-hisat-annot/

graceleuchtenberger commented 1 month ago

Thank you, this is a great explanation! Thankfully, the mussel genome has a cds file with the gene ID locus numbers in brackets, like you said at the end of the video, so I used your code from the solution you linked and it worked like a charm!

RobertsLab / resources

Advice on mapping gene locus ID's to protein ID's? #1969