Closed meganewing closed 2 months ago
34/47 blast hits is expected in species such as this.
Note you will not get NAs with blast alone (just no return).
If they are annotated on ncbi - why would you need to blast?
"Note you will not get NAs with blast alone (just no return)." -- Can you clarify what you mean by this?
Also I suppose that makes sense to not need to use blast if they are annotated. I still want an outcome dataframe that has basically all of the info blast would give me for each of my DEG's. What would your guidance be on this? Is there a way to download the annotation table?
Also, for my own curiosity's sake, why is it that 34/47 hits is expected? Why would it not be all 47?
"Note you will not get NAs with blast alone (just no return)." -- Can you clarify what you mean by this?
Paste blast results here with NAs.. I could be wrong.
What would your guidance be on this?
Merge your blast results with NCBI data (great cross-check as most should match)
is there a way to download the annotation table?
yes - https://d.pr/i/NOKlBs
why is it that 34/47 hits is expected? Why would it not be all 47?
It is not uncommon to have almost have of genes not have SP blast hits depending on the species (and evalue)
Thanks!
Okay I may have closed the issue prematurely.
I've never joined the DEG with an already fully annotated gene list before. How should I proceed towards UniProt and getting GO terms since there's no Uniprot accession ID? I tried using the Gene id and accession number that were included in the annotation file, but yielded no results on uniprot even when switching the 'from' data base to refseq.
Any guidance appreciated -- thank you!
Provide url to notebook posts on analysis done so far.
On Mon, Aug 26, 2024 at 7:29 PM Megan Ewing @.***> wrote:
Okay I may have closed the issue prematurely.
I've never joined the DEG with an already fully annotated gene list before. How should I proceed towards UniProt and getting GO terms since there's no Uniprot accession ID? I tried using the Gene id and accession number that were included in the annotation file, but yielded no results on uniprot even when switching the 'from' data base to refseq.
Any guidance appreciated -- thank you!
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/RobertsLab/resources/issues/1953*issuecomment-2311448517__;Iw!!K-Hz7m0Vt54!jRLXs-v3YvmqnZqXRSVgs5foEUnWlzrW7Y5MWTqEgKqRrHoUsDk_i-C8SxM84ObYba2jSbjuNLLYkd9mgF3qrhs$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN64SJYHM62AOR3K6RDZTPQBTAVCNFSM6AAAAABNETS72KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGQ2DQNJRG4__;!!K-Hz7m0Vt54!jRLXs-v3YvmqnZqXRSVgs5foEUnWlzrW7Y5MWTqEgKqRrHoUsDk_i-C8SxM84ObYba2jSbjuNLLYkd9mPZiLDi0$ . You are receiving this because you commented.Message ID: @.***>
And be sure it is clear in the post what you are attempting to do and what the specific challenge is - important to show portion of tables (head) to indicate problem
I think you need to provide a more granular breakdown of your various analyses in your notebook so we can see what every step looks like. I know you've used the head()
in your R Markdown scripts, but we can't see the output from that command.
You notebook should have some of the following, for example:
head
) of BLAST output table.head
) of 0807-DEGstats_ToC_8ind23r.tab
.head
) of ncbi_dataset.tsv
.Including step-wise previews of the various tables/files will greatly help.
Also, a clear statement on what data you're starting with and what you want for an end result will help.
Pretty close.
You forgot this key point, though:
Also, a clear statement on what data you're starting with and what you want for an end result will help.
I also saw this in your post:
my computer doesn’t work with the VPN.
Create an issue (or reopen???) to try to get this addressed (again?).
Oh, and to elaborate on this:
Also, a clear statement on what data you're starting with and what you want for an end result will help.
I have absolutely no idea what you're working on (as is probably the case with other people who might be able to help). So, a more general overview describing what you're trying to accomplish would be immensely helpful.
Pretty close.
You forgot this key point, though:
Also, a clear statement on what data you're starting with and what you want for an end result will help.
This should be at the bottom of the post! I'll update the notebook again in a sec but its Control v. Treatment manila clam rna seq data.
Also regarding the VPN.. I went to the SAFS tech guy and he said its because it's they updated the security for it and I need ios ventura (or whatever v13+ is), which is not available on any macs before 2017 (and i have a 2016). I've making it work by doing what I can remotely, and then goin in person when needed.
The R Pubs link is really what we've wanted. This is a good start!
https://rpubs.com/mewing0/1214571
I'd say that your R Pubs (and/or your R Markdown and/or your notebook), should have more text explaining what you're seeing in files and describing how/why you're taking the next steps.
For example, in the first step of the R Pubs file, you state:
Read in DEG count and stats files
Explain why.
In my mind, that's my first question. Why would you need read counts for annotations?
Although I think I know why, it will be very helpful for you to write out why you're preforming each step.
Then, for another example where it would be helpful to explain your thought process:
Join them
Explain why you're doing this. Also, write out what you expect the output file to look like and assess (write out) if the output matches your expectations.
Those are just two examples. You should do this for most of the code chunks.
It will be very helpful to you (and us)!
Will it be tedious?
Yep!
Sometimes that's how it goes.
We'll get you through this; don't worry!
Thanks for the info regarding the VPN. We still need to figure out a solution for this, but it's not encouraging that SAFS IT didn't have any suggestions...
Got it! Thank you for the elaboration and guidance! I'm at work now, but will work on this tonight/tomorrow and send an updated link!
Also yea... I've needed a new computer for a while anyways so I may just bite the bullet on this if it gets too cumbersome. Problem for another day though!
Okay should be updated now! Let me know your thoughts https://rpubs.com/mewing0/1214571
Thank you!
Can you please show us some of the stuff in the chunk starting with:
# read in blast full results
blastfull <- read.csv("../output/0821-rphil_blast_cds.tab", sep="")
It will be helpful to see what's in the those files/dataframes.
done! Should be same link
My preferred method to obtain GO terms via SwissProt IDs is in the Handbook:
https://robertslab.github.io/resources/bio-Annotation/#gene-ontology-go
In your instance, you'll need to extract the SwissProt IDs from blast_id_deg
.
Personally, I use awk
(in bash) to do this kind of stuff. So, you'd likely need to write blast_id_deg
to a file (e.g. blast_id_deg.tab
). To use awk
, I'd do the following in a bash chunk:
awk -F"|" 'NR > 1 {print $2}' .../output/blast_id_deg.tab | sort --unique > ../output/blast-SPID-unique.txt
That will separate the file using |
as a delimiter. It will skip the header line (i.e. the first record NR > 1
). Then, it will print the 2nd field ($2
) which should be your SwissProt ID.
That's followed by sorting unique values and then writing to a new output file (> ../output/blast-SPID-unique.txt
).
REMEMBER: The code I've shown above is an example! You might need to modify it to work with your specific use case(s).
You can then use the ../output/blast-SPID-unique.txt
as the input file for the approach outlined in the Handbook link above.
However, now that I've typed this all out, this issue is getting derailed. We should close this and open a new issue if you need help with obtaining GO terms (or, anything else that isn't related to your concern about BLAST results)...
Ah I think there may be a misunderstanding
From blast, I know how to get SPID and where to go from there. The original reason I opened this issue was because I was not getting a SPID for each of my LOC when going the typical blast route... to which i was directed to use the published annotation available... which does not contain SPID...
and so the circle goes.
I just dont understand /why/ there aren't SPIDs available for all of the LOC, even with the genome annotated. Or perhaps theres another way to get GO terms besides SPIDs that I'm unaware of ( i have protein accession ids in the annotated (from genome) DEG file )
does this all make sense?
I just dont understand /why/ there aren't SPIDs available for all of the LOC, even with the genome annotated. Or perhaps theres another way to get GO terms besides SPIDs that I'm unaware of ( i have protein accession ids in the annotated (from genome) DEG file )
Some genes will not be annotated or simply called "uncharacterized" - genome annotation also refers to describing where genes are. The identity of genes is usually completed by finding similar sequences in other species. In short - some genes you will not know what they are similar to or what to call them. If you do not get a Blast to Swiss-prot, do not expect to get any GO information.
I must just be having a hard time wrapping my head around this, but even for those that aren't uncharacterized in the annotated genome are not showing up with SPID (eg. titin homolog) ?
let me see if I understand this correctly,
just because a genome is annotated, does NOT mean that the genes are present/have corresponding IDs within the SP database, even if they are identified within the database the annotation came from (in this case, RefSeq) ?
if this is true, why?
it feels weird just abandoning some of the DEGs, espcially when most of them aren't listed as uncharacterizing, and things like titin homolog are listed for other species on SP
Outside of code... lets think big picture, what the end goal is. 🏁
You have DEGs you want to describe to explain physiology..
You take your ~50 DEGs - their annotations based on what NCBI provides... you look in the literature regarding what the gene functions are, synthesize.. and you are done!
Running blast, but not all of the LOC# are getting hits for anything -- just a list of NA's. I have 47 DEGs, but only 34 of them have matches when I go to join my blast results. (joined blast and DEG results here)
When I sanity checked by looking at the genome viewer / annotated genes on ncbi, the LOC# (or "symbols") that are returning NA's all have matches except for 2 which are categorized as "uncharacterized".
I made my e value less stringent (from 1e-20 to 1e-5) to try an see if that was it, to no avail.
Any clues as to whats going wrong here?