Closed graceleuchtenberger closed 1 month ago
Here's my blast code.
Can you please provide links to the various files involved here? Also, can you please provide the output from the head
command for those same files?
That will be immensely helpful in getting the layout d the land.
Can you please provide links to the various files involved here?
In addition to DESeq output / count matrix.. also indicate what fasta files are available- (also showing their format eg head)
and just so we do not get to off course... code for the alignment used for Deseq2?
I think the cds and fasta files are too big to push to GitHub (on the order of up to 272 MB), but if that assumption is incorrect let me know and I will take them out of gitignore. I've put the file paths here in case you want to use them in Raven.
Head of the fasta I used to make my blastdb, which is all of Swissprot plus some sequences for unreviewed Mytilus foot proteins: File path: /home/shared/8TB_HDD_02/graceleuchtenberger/Github/byssus-exp-analysis/data/swissprot_n_mytilusfoot.fasta
Head of the gene count matrix: Link here
Head of the cds file used as query file for blast: File path: /home/shared/8TB_HDD_02/graceleuchtenberger/Github/byssus-exp-analysis/data/ncbi_dataset/data/GCF_036588685.1/cds_from_genomic.fasta
Here's more of the code: HiSat2 alignment that Steven did
As I'm looking back at that HiSat2 code, I think Steven just used a different fasta file for HiSat (GCF_036588685.1_PNRI_Mtr1.1.1.hap1_genomic.fna) than I did for running the blast. Perhaps that's where I should start. Just missed that when thinking about the problem.
Thanks. Also, could you please provide link/head
to BLAST output?
As I'm looking back at that HiSat2 code, I think Steven just used a different fasta file for HiSat (GCF_036588685.1_PNRI_Mtr1.1.1.hap1_genomic.fna) than I did for running the blast. Perhaps that's where I should start.
This seems like it.
However, can you please repeat (and/or rephrase) what you're trying to accomplish?
The issue title suggests you're trying to do some ID mapping, but in your initial post you also said
so that way I can annotate my significantly expressed genes?
If you're trying to do the latter, could you just BLAST those genes against SwissProt in order to get annotations?
I think my original thinking was that Steven and I had used the same fasta file for the indexing and BLAST, so I was trying to do some ID mapping to deal with just the differences in what showed up in the geneID or transcriptID column. But now that I know Steven just used a different file, I'll just re-run the blast using the file he used which will most likely generate the same ID's, making it easier to match my annotation table to my significant genes tables.
@graceleuchtenberger what will you blast?
Show us the blast output you have, and the head of any other fasta like an rna.fa file.
Here's the head of the blast output file after using the cds file as the query:
Here's the head of the fna file used for HiSat that I think I should use for the annotation blast:
Here's the head of the fna file used for HiSat that I think I should use for the annotation blast:
No that it the genome. you should not blast that.
Here is a analogous issue and solution: https://robertslab.github.io/tusk/modules/04-blast.html#navigating-annotation
can elaborate on more later.
This is super helpful, thank you!
Here is an attempt to try to sketch out challenges... and solutions https://sr320.github.io/tumbling-oysters/posts/30-hisat-annot/
Thank you, this is a great explanation! Thankfully, the mussel genome has a cds file with the gene ID locus numbers in brackets, like you said at the end of the video, so I used your code from the solution you linked and it worked like a charm!
For the byssus genetic markers project that Matt George started, I'm trying to get GO terms and gene ID's for differentially expressed genes that I got through DESeq.
I am in the process of annotating the cds file that we used to index the tag-seq data, but I'm running into a bit of an issue. In my gene count matrix (that I used for DESeq), the listed transcripts with their copy numbers are specified as gene locus ID's (not sure if I'm saying that right), but they look like this: gene-LOC134696364|LOC134696364 Other times it looks like they have an actual idea of the gene: gene-Trnam-cau-162|Trnam-cau
I want to be able to left join my annotation table to my tables of significant genes (originally derived from the gene count matrix using the same identifiers) by the gene ID or transcript ID, but the output I got from blasting my genome against the reviewed UniProt database has identifiers that look like this: lcl|NC_086373.1_cds_XP_063416577.1_1.
So it connects back to the accession number of the overall cds file (the NC number) and gives me a protein ID, but doesn't give me a corresponding location on the genome like HiSat did when Steven used the cds file to index the tag-seq data.
From some googling, I've seen that there are ways to map protein ID's to gene ID's, but these are just loci on a genome that don't necessarily have associated ID's.
I guess my question is, is there a way to get blast to spit out the same loci that HiSat did, so that way I can annotate my significantly expressed genes? Or is there a way to get around my problem in a different way? Thought I'd post an issue in the likely case someone has dealt with this before.