Blast results not all inclusive

meganewing commented 3 months ago

Running blast, but not all of the LOC# are getting hits for anything -- just a list of NA's. I have 47 DEGs, but only 34 of them have matches when I go to join my blast results. (joined blast and DEG results here)

When I sanity checked by looking at the genome viewer / annotated genes on ncbi, the LOC# (or "symbols") that are returning NA's all have matches except for 2 which are categorized as "uncharacterized".

I made my e value less stringent (from 1e-20 to 1e-5) to try an see if that was it, to no avail.

Any clues as to whats going wrong here?

sr320 commented 3 months ago

34/47 blast hits is expected in species such as this.

Note you will not get NAs with blast alone (just no return).

If they are annotated on ncbi - why would you need to blast?

meganewing commented 3 months ago

"Note you will not get NAs with blast alone (just no return)." -- Can you clarify what you mean by this?

Also I suppose that makes sense to not need to use blast if they are annotated. I still want an outcome dataframe that has basically all of the info blast would give me for each of my DEG's. What would your guidance be on this? Is there a way to download the annotation table?

meganewing commented 3 months ago

Also, for my own curiosity's sake, why is it that 34/47 hits is expected? Why would it not be all 47?

sr320 commented 3 months ago

"Note you will not get NAs with blast alone (just no return)." -- Can you clarify what you mean by this?

Paste blast results here with NAs.. I could be wrong.

What would your guidance be on this?

Merge your blast results with NCBI data (great cross-check as most should match)

is there a way to download the annotation table?

yes - https://d.pr/i/NOKlBs

why is it that 34/47 hits is expected? Why would it not be all 47?

It is not uncommon to have almost have of genes not have SP blast hits depending on the species (and evalue)

meganewing commented 3 months ago

Thanks!

meganewing commented 3 months ago

Okay I may have closed the issue prematurely.

I've never joined the DEG with an already fully annotated gene list before. How should I proceed towards UniProt and getting GO terms since there's no Uniprot accession ID? I tried using the Gene id and accession number that were included in the annotation file, but yielded no results on uniprot even when switching the 'from' data base to refseq.

Any guidance appreciated -- thank you!

sr320 commented 3 months ago

Provide url to notebook posts on analysis done so far.

On Mon, Aug 26, 2024 at 7:29 PM Megan Ewing @.***> wrote:

Okay I may have closed the issue prematurely.

I've never joined the DEG with an already fully annotated gene list before. How should I proceed towards UniProt and getting GO terms since there's no Uniprot accession ID? I tried using the Gene id and accession number that were included in the annotation file, but yielded no results on uniprot even when switching the 'from' data base to refseq.

Any guidance appreciated -- thank you!

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/RobertsLab/resources/issues/1953*issuecomment-2311448517__;Iw!!K-Hz7m0Vt54!jRLXs-v3YvmqnZqXRSVgs5foEUnWlzrW7Y5MWTqEgKqRrHoUsDk_i-C8SxM84ObYba2jSbjuNLLYkd9mgF3qrhs$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN64SJYHM62AOR3K6RDZTPQBTAVCNFSM6AAAAABNETS72KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGQ2DQNJRG4__;!!K-Hz7m0Vt54!jRLXs-v3YvmqnZqXRSVgs5foEUnWlzrW7Y5MWTqEgKqRrHoUsDk_i-C8SxM84ObYba2jSbjuNLLYkd9mPZiLDi0$ . You are receiving this because you commented.Message ID: @.***>

sr320 commented 3 months ago

And be sure it is clear in the post what you are attempting to do and what the specific challenge is - important to show portion of tables (head) to indicate problem

meganewing commented 2 months ago

https://meganewing.github.io/mewing-notebook/posts/2024-08/2024-08.html#august-26th

kubu4 commented 2 months ago

I think you need to provide a more granular breakdown of your various analyses in your notebook so we can see what every step looks like. I know you've used the head() in your R Markdown scripts, but we can't see the output from that command.

You notebook should have some of the following, for example:

Provide preview (head) of BLAST output table.
Provide preview of your query file(s) for BLASTing.
Provide preview (head) of 0807-DEGstats_ToC_8ind23r.tab.
Provide preview (head) of ncbi_dataset.tsv.

Including step-wise previews of the various tables/files will greatly help.

Also, a clear statement on what data you're starting with and what you want for an end result will help.

meganewing commented 2 months ago

How's this? : https://meganewing.github.io/mewing-notebook/posts/2024-08/2024-08.html#august-28th

kubu4 commented 2 months ago

Pretty close.

You forgot this key point, though:

Also, a clear statement on what data you're starting with and what you want for an end result will help.

kubu4 commented 2 months ago

I also saw this in your post:

my computer doesn’t work with the VPN.

Create an issue (or reopen???) to try to get this addressed (again?).

kubu4 commented 2 months ago

Oh, and to elaborate on this:

Also, a clear statement on what data you're starting with and what you want for an end result will help.

I have absolutely no idea what you're working on (as is probably the case with other people who might be able to help). So, a more general overview describing what you're trying to accomplish would be immensely helpful.

meganewing commented 2 months ago

Pretty close.

You forgot this key point, though:

Also, a clear statement on what data you're starting with and what you want for an end result will help.

This should be at the bottom of the post! I'll update the notebook again in a sec but its Control v. Treatment manila clam rna seq data.

Also regarding the VPN.. I went to the SAFS tech guy and he said its because it's they updated the security for it and I need ios ventura (or whatever v13+ is), which is not available on any macs before 2017 (and i have a 2016). I've making it work by doing what I can remotely, and then goin in person when needed.

kubu4 commented 2 months ago

The R Pubs link is really what we've wanted. This is a good start!

https://rpubs.com/mewing0/1214571

I'd say that your R Pubs (and/or your R Markdown and/or your notebook), should have more text explaining what you're seeing in files and describing how/why you're taking the next steps.

For example, in the first step of the R Pubs file, you state:

Read in DEG count and stats files

Explain why.

In my mind, that's my first question. Why would you need read counts for annotations?

Although I think I know why, it will be very helpful for you to write out why you're preforming each step.

Then, for another example where it would be helpful to explain your thought process:

Join them

Explain why you're doing this. Also, write out what you expect the output file to look like and assess (write out) if the output matches your expectations.

Those are just two examples. You should do this for most of the code chunks.

It will be very helpful to you (and us)!

Will it be tedious?

Yep!

Sometimes that's how it goes.

We'll get you through this; don't worry!

Thanks for the info regarding the VPN. We still need to figure out a solution for this, but it's not encouraging that SAFS IT didn't have any suggestions...

meganewing commented 2 months ago

Got it! Thank you for the elaboration and guidance! I'm at work now, but will work on this tonight/tomorrow and send an updated link!

Also yea... I've needed a new computer for a while anyways so I may just bite the bullet on this if it gets too cumbersome. Problem for another day though!

meganewing commented 2 months ago

Okay should be updated now! Let me know your thoughts https://rpubs.com/mewing0/1214571

Thank you!

kubu4 commented 2 months ago

Can you please show us some of the stuff in the chunk starting with:

# read in blast full results
blastfull <- read.csv("../output/0821-rphil_blast_cds.tab", sep="")

It will be helpful to see what's in the those files/dataframes.

meganewing commented 2 months ago

done! Should be same link

kubu4 commented 2 months ago

My preferred method to obtain GO terms via SwissProt IDs is in the Handbook:

https://robertslab.github.io/resources/bio-Annotation/#gene-ontology-go

In your instance, you'll need to extract the SwissProt IDs from blast_id_deg.

Personally, I use awk (in bash) to do this kind of stuff. So, you'd likely need to write blast_id_deg to a file (e.g. blast_id_deg.tab). To use awk, I'd do the following in a bash chunk:

awk -F"|" 'NR > 1 {print $2}' .../output/blast_id_deg.tab | sort --unique > ../output/blast-SPID-unique.txt

That will separate the file using | as a delimiter. It will skip the header line (i.e. the first record NR > 1). Then, it will print the 2nd field ($2) which should be your SwissProt ID.

That's followed by sorting unique values and then writing to a new output file (> ../output/blast-SPID-unique.txt).

REMEMBER: The code I've shown above is an example! You might need to modify it to work with your specific use case(s).

You can then use the ../output/blast-SPID-unique.txt as the input file for the approach outlined in the Handbook link above.

kubu4 commented 2 months ago

However, now that I've typed this all out, this issue is getting derailed. We should close this and open a new issue if you need help with obtaining GO terms (or, anything else that isn't related to your concern about BLAST results)...

meganewing commented 2 months ago

Ah I think there may be a misunderstanding

From blast, I know how to get SPID and where to go from there. The original reason I opened this issue was because I was not getting a SPID for each of my LOC when going the typical blast route... to which i was directed to use the published annotation available... which does not contain SPID...

and so the circle goes.

I just dont understand /why/ there aren't SPIDs available for all of the LOC, even with the genome annotated. Or perhaps theres another way to get GO terms besides SPIDs that I'm unaware of ( i have protein accession ids in the annotated (from genome) DEG file )

does this all make sense?

sr320 commented 2 months ago

I just dont understand /why/ there aren't SPIDs available for all of the LOC, even with the genome annotated. Or perhaps theres another way to get GO terms besides SPIDs that I'm unaware of ( i have protein accession ids in the annotated (from genome) DEG file )

Some genes will not be annotated or simply called "uncharacterized" - genome annotation also refers to describing where genes are. The identity of genes is usually completed by finding similar sequences in other species. In short - some genes you will not know what they are similar to or what to call them. If you do not get a Blast to Swiss-prot, do not expect to get any GO information.

meganewing commented 2 months ago

I must just be having a hard time wrapping my head around this, but even for those that aren't uncharacterized in the annotated genome are not showing up with SPID (eg. titin homolog) ?

meganewing commented 2 months ago

let me see if I understand this correctly,

just because a genome is annotated, does NOT mean that the genes are present/have corresponding IDs within the SP database, even if they are identified within the database the annotation came from (in this case, RefSeq) ?

if this is true, why?

it feels weird just abandoning some of the DEGs, espcially when most of them aren't listed as uncharacterizing, and things like titin homolog are listed for other species on SP

sr320 commented 2 months ago

Outside of code... lets think big picture, what the end goal is. 🏁

You have DEGs you want to describe to explain physiology..

You take your ~50 DEGs - their annotations based on what NCBI provides... you look in the literature regarding what the gene functions are, synthesize.. and you are done!

RobertsLab / resources

Blast results not all inclusive #1953