Filter isoform data by e-value with bash

yaaminiv commented 5 years ago

@colleenburge and I are working with a subset of eelgrass transcriptomics data from Trinity. This file was generated with this script.

The Trinity output includes isoform designations, but we only want to work at the gene-level. For genes with multiple isoforms, we want to retain the row with the lowest e-value. @sr320 how can we filter our data this way with bash? Do I need to use a for loop/if then statement, or is there a series of awk commands to do this?

yaaminiv commented 5 years ago

I could imagine an if/then statement in which if the entries in the third column matched for 2 rows, you would retain the row with the lowest e-value. Just can't figure out how to code it...

kubu4 commented 5 years ago

Something like:

sort --unique --key=1,1 _blast-annot-geneOnly.tab

This sorts the files on just the values in the first column (e.g. --key=1,1) and then picks the unique values in that column.

This operates on the assumption that the e-values are already sorted, lowest to highest for a given gene, as that's how BLAST output its results.

Without that assumption, I think you'll have to use a more powerful programming language (e.g. R) to convert the scientific number notation into a full-blown number that isn't interpreted as a string, in order to sort the e-values numerically.

EDITED: Clarified e-value sorting as string, instead of numerically.

kubu4 commented 5 years ago

Eh, ignore the post above. Looks like bash commands will recognize scientific notation!

sort \
--general-numeric-sort \
--key=1,1 \
--key=4,4 \
_blast-annot-geneOnly.tab \
| sort \
--unique \
--key=1,1

The --general-numeric-sort will recognize those e-values properly. The 2nd sorting step seems to be needed in order to get the --unique command to work properly on column 1.

EDITED: Used original file name used in this issue instead of custom filename I was using for the same data.

kubu4 commented 5 years ago

I'm not entirely convinced the solution above works properly. Do you have a larger data set that we can work with? Three lines isn't really sufficient for testing.

yaaminiv commented 5 years ago

@kubu4 Here's the full annotated file: https://github.com/eimd-2019/project-EWD-transcriptomics/blob/master/data/Zostera_SwissProt_e5_output_geneOnly.tab

kubu4 commented 5 years ago

That's not the same format as the file you want help with. Please post a longer version of that formatted file when you have the chance.

yaaminiv commented 5 years ago

I don't have access to the full joined file (yet), but the file I posted above has gene IDs, the associated isoforms, and e-values (second-to-last column)! Is there a way to apply similar commands to filtering down isoforms here? I think we just want to figure out how to do that and can troubleshoot similar commands with different files later.

kubu4 commented 5 years ago

Really, the question is does the e-value matter?

You indicated you want to look at the gene level. Although technically possible, I'd be very surprised if multiple isoforms for a given gene matched to different proteins (if that were the case, then it would be considered a different gene entirely). As such, the e-value doesn't matter.

So, for your current data set, I think what you really want is a list of all the unique gene IDs.

Alternatively, go back to the Trinity assembly and only BLAST the genes, not the isoforms. Then, you'll end up with a BLAST output without those pesky isoforms.

colleenburge commented 5 years ago

I agree that we want the unique gene IDs. I guess we were thinking to include the best e-value with the Gene ID but I suppose that it doesn't actually matter that much?

Our time is limited for the students to work with the dataset so I don't really want to redo the blast at this point. We both have enough to do right now :)

On Tue, Jul 9, 2019 at 3:23 PM kubu4 notifications@github.com wrote:

Really, the question is does the e-value matter?

You indicated you want to look at the gene level. Although technically possible, I'd be very surprised if multiple isoforms for a given gene matched to different proteins (if that were the case, then it would be considered a different gene entirely). As such, the e-value doesn't matter.

So, for your current data set, I think what you really want is a list of all the unique gene IDs.

Alternatively, go back to the Trinity assembly and only BLAST the genes, not the isoforms. Then, you'll end up with a BLAST output without those pesky isoforms.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/708?email_source=notifications&email_token=AGJXQ2VINGGVO2TF7C2P47DP6UFVVA5CNFSM4H7I4TL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZRWX5Q#issuecomment-509832182, or mute the thread https://github.com/notifications/unsubscribe-auth/AGJXQ2RPMAVMAUTDBQTGJEDP6UFVVANCNFSM4H7I4TLQ .

-- Colleen Burge Assistant Professor Institute of Marine and Environmental Technology University of Maryland Baltimore County Columbus Center 701 East Pratt Street Baltimore MD 21202 USA (410) 234-8834 (office) (410) 234-8835 (lab) Website: http://imet.umces.edu/cburge/

kubu4 commented 5 years ago

best e-value with the Gene ID but I suppose that it doesn't actually matter that much?

I don't think so, but I also don't know what the end goal of this is.

yaaminiv commented 5 years ago

The end goal is to have students perform a gene enrichment to see if there are any overrepresented biological processes in response to pathogen infection in eelgrass. If there's a low chance isoforms would map to different biological processes, then we can just have students look at unique gene IDs. If needed, @colleenburge and I can re-blast later.

sr320 commented 5 years ago

If there's a low chance isoforms would map to different biological processes,

yes very very low chance. I would not even expect them to blast to different proteins.

feel free to call me because I am still not clear on end goal.

sr320 commented 5 years ago

In my mind you should be using the transcriptome to find DEGs. use the correlated UNIPROT Accessions from the DEGs and see if they are overrepresented compared to UNIPROT Accessions from entire transcriptome. Not seeing how "isoforms" are causing an issue.

colleenburge commented 5 years ago

Yes, that's what we are doing but I don't want to work with the isoform data--I think the differential expression analysis is beyond what I have time to redo right now.

Plus, the project partners agreed to use the gene data and I'd like to continue with this analysis. Its unfortunate that the blast searches were done on the isoform data (by Amanda while she was still in the lab). I'm hoping we can proceed as we originally planned and get the annotations/GO analysis on just the gene data.

On Tue, Jul 9, 2019 at 7:16 PM Steven Roberts notifications@github.com wrote:

In my mind you should be using the transcriptome to find DEGs. use the correlated UNIPROT Accessions from the DEGs and see if they are overrepresented compared to UNIPROT Accessions from entire transcriptome. Not seeing how "isoforms" are causing an issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/708?email_source=notifications&email_token=AGJXQ2QUD3JFDKE7RMWE4MDP6VBBDA5CNFSM4H7I4TL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSCOTQ#issuecomment-509880142, or mute the thread https://github.com/notifications/unsubscribe-auth/AGJXQ2X3LXX56NGYNMTJRWLP6VBBDANCNFSM4H7I4TLQ .

kubu4 commented 5 years ago

Out of curiosity, what is/was the workflow?

Trinity differential gene expression analysis (using edgeR and salmon) runs very quickly (we're talking 15 - 20 mins). This can be done solely at the gene level. Admittedly, the set up for running this takes a bit of time/effort.

Additionally, if you happened to have used Transdecoder and Trinotate (part of the Trinity trinity), you can re-run Trinotate to find enriched/depleted GO terms on your differentially expressed genes.

yaaminiv commented 5 years ago

@kubu4 Amanda used Trinity and we're using edgeR for differential expression analysis. Not clear if Transdecoder/Trinotate were used...@colleenburge may know.

This can be done solely at the gene level. Admittedly, the set up for running this takes a bit of time/effort.

What sort of code modifications do we need?

kubu4 commented 5 years ago

What sort of code modifications do we need?

I don't know. What code was used for all of this? The repo linked above is devoid of any of that type of info.

Regardless, the Trinity wiki has a pretty solid walk-through on how to run all of this.

colleenburge commented 5 years ago

Hi guys, I am to explain in more detail but there were many reasons we did not run the whole analysis in Trinity. We are using a specific model in Edge R to analyze the data.

Sending over code and files when I am done working with the students.

On Wed, Jul 10, 2019 at 10:15 AM kubu4 notifications@github.com wrote:

What sort of code modifications do we need?

I don't know. What code was used for all of this? The repo linked above is devoid of any of that type of info.

Regardless, the Trinity wiki has a pretty solid walk-through on how to run all of this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/708?email_source=notifications&email_token=AGJXQ2QG7RUA43EBDZPT2BTP6YKJXA5CNFSM4H7I4TL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZUEMWI#issuecomment-510150233, or mute the thread https://github.com/notifications/unsubscribe-auth/AGJXQ2QMW6GEJYBEQRCZIGDP6YKJXANCNFSM4H7I4TLQ .

-- Colleen Burge Assistant Professor Institute of Marine and Environmental Technology University of Maryland Baltimore County Columbus Center 701 East Pratt Street Baltimore MD 21202 USA (410) 234-8834 (office) (410) 234-8835 (lab) Website: http://imet.umces.edu/cburge/

colleenburge commented 5 years ago

Here's the code and files:

https://umbc.box.com/s/gqznj69dcsfbolrnoygkvzh3s7nvxuqg

It's still uploading so might take a minute. You should be able to just click the link. Let me know if there's a problem.

Colleen:)

On Wed, Jul 10, 2019 at 10:37 AM Colleen Burge colleenb@umbc.edu wrote:

Hi guys, I am to explain in more detail but there were many reasons we did not run the whole analysis in Trinity. We are using a specific model in Edge R to analyze the data.

Sending over code and files when I am done working with the students.

On Wed, Jul 10, 2019 at 10:15 AM kubu4 notifications@github.com wrote:

What sort of code modifications do we need?

I don't know. What code was used for all of this? The repo linked above is devoid of any of that type of info.

Regardless, the Trinity wiki has a pretty solid walk-through on how to run all of this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/708?email_source=notifications&email_token=AGJXQ2QG7RUA43EBDZPT2BTP6YKJXA5CNFSM4H7I4TL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZUEMWI#issuecomment-510150233, or mute the thread https://github.com/notifications/unsubscribe-auth/AGJXQ2QMW6GEJYBEQRCZIGDP6YKJXANCNFSM4H7I4TLQ .

-- Colleen Burge Assistant Professor Institute of Marine and Environmental Technology University of Maryland Baltimore County Columbus Center 701 East Pratt Street Baltimore MD 21202 USA (410) 234-8834 (office) (410) 234-8835 (lab) Website: http://imet.umces.edu/cburge/

-- Colleen Burge Assistant Professor Institute of Marine and Environmental Technology University of Maryland Baltimore County Columbus Center 701 East Pratt Street Baltimore MD 21202 USA (410) 234-8834 (office) (410) 234-8835 (lab) Website: http://imet.umces.edu/cburge/

kubu4 commented 5 years ago

Per the Zoom conversation, simplified code for eliminating duplicate gene IDs:

sort --unique --key=1,1 Zostera_SwissProt_e5_output_geneOnly.tab

yaaminiv commented 5 years ago

@kubu4 that worked!

files here, Jupyter notebook here

colleenburge commented 5 years ago

Thank you all!

On Wed, Jul 10, 2019 at 5:25 PM Yaamini Venkataraman < notifications@github.com> wrote:

@kubu4 https://github.com/kubu4 that worked!

files here https://github.com/eimd-2019/project-EWD-transcriptomics/tree/master/data, Jupyter notebook here https://github.com/eimd-2019/project-EWD-transcriptomics/blob/master/scripts/2019-07-10-blastx-Uniprot-File-Merging.ipynb

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/708?email_source=notifications&email_token=AGJXQ2SGPNUZVCOW6D4UDNTP6Z4YHA5CNFSM4H7I4TL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZVDXKI#issuecomment-510278569, or mute the thread https://github.com/notifications/unsubscribe-auth/AGJXQ2WLHJHWY3K3HBXQUWDP6Z4YHANCNFSM4H7I4TLQ .

-- Colleen Burge Assistant Professor Institute of Marine and Environmental Technology University of Maryland Baltimore County Columbus Center 701 East Pratt Street Baltimore MD 21202 USA (410) 234-8834 (office) (410) 234-8835 (lab) Website: http://imet.umces.edu/cburge/

RobertsLab / resources

Filter isoform data by e-value with bash #708