NCBI-Hackathons / GeneFamTaxScan

A framework and family of scripts to evaluate molecular evolution (and misannotation) of gene ortholog groups, between higher taxa.
MIT License
0 stars 1 forks source link

BLAST search contig/assembly-level genome sequence from a taxonomic subset #2

Open PhyloGrok opened 6 years ago

PhyloGrok commented 6 years ago

This issue segment works make a BLAST db from taxonomy-delimited esearch, nblast against the homolog2fasta.sh retreived query sequences to identify genomic loci matching homologenes.

PhyloGrok commented 6 years ago

$ wget esearch -db genome -query txid10088[Organism:exp] | elink -target assembly | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}'

PhyloGrok commented 6 years ago

A solution may be to have the reference list BLAST against a comprehensive NCBI Assembly BLAST database, then filter the results by taxa? -it would save having to make a local, taxa-specific BLAST db.

DCGenomics commented 6 years ago

We could compare both, but it seems the trimmed blastdb would be much faster.

Cheers!

Ben

On Tue, Nov 28, 2017 at 4:06 PM, Robinson, JM notifications@github.com wrote:

A solution may be to have the reference list BLAST against a comprehensive NCBI Assembly BLAST database, then filter the results by taxa? -it would save having to make a local, taxa-specific BLAST db.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/GeneFamTaxScan/issues/2#issuecomment-347663581, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePta8EjOCwqCeOlkJqWf43H9wV87DBks5s7HXwgaJpZM4Qtxu9 .

-- What have you done today to make the world a better place?

PhyloGrok commented 6 years ago
$ cat GCF_000001635.26_GRCm38.p6_genomic.fna.gz  GCF_900094665.1_CAROLI_EIJ_v1.1_genomic.fna.gz  GCF_900095145.1_PAHARI_EIJ_v1.1_genomic.fna.gz > lib_mouse.fna.gz
$ gunzip lib_mouse.fna.gz lib_mouse.fna

$ makeblastdb -dbtype nucl -in lib_mouse.fna
$ blastn -db lib_mouse.fna -query drosha.fasta -evalue 1e-20 -outfmt "6 scomnames sseqid length bitscore"

BLAST output needs work.. lots of really good, smallish (100-200bp) BLAST hits covering the query gene. How to go from this to say 'yes/no' the gene is present or 'not misassembled'? RefSeq protein search will probably be a better answer for whether annotation has 'found' a gene of interest.

PhyloGrok commented 6 years ago

now can go from homologeneID to blastn on assemblies in a selected taxonomic range. Putting together the BLAST calls into a script now. What is the best visual output format/formatting for BLAST results?

DCGenomics commented 6 years ago

I would use outfmt 6 for now

On Wed, Nov 29, 2017 at 2:58 PM, Robinson, JM notifications@github.com wrote:

now can go from homologeneID to blastn on assemblies in a selected taxonomic range. Putting together a script for BLAST now. What is the best visual output formatter for BLAST results?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/GeneFamTaxScan/issues/2#issuecomment-347978081, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtdBItm3e1ml7xbPsIQ1pSFamZok1ks5s7bdggaJpZM4Qtxu9 .

-- What have you done today to make the world a better place?

PhyloGrok commented 6 years ago

BLASTs are taking like 10-15 min, the blastDB is only 3 genomes large and query is only 8 sequences. What to do with libraries of high level taxonomy, such as Insects or Protostomes, that have 100s of genomes? Is that pretty common?

blastn -db lib_mouse.fna -query drosha.fasta -num_threads 20 -outfmt "6 scomnames sseqid length bitscore" > Mouse1

Add '-num_threads' to reduce time.

PhyloGrok commented 6 years ago

BLAST Output returns many contiguous matches -

NC_000081.6:12824803-12935290   110488  N/A     NC_034584.1     14583   90744   105108  0.0     20572
NC_000081.6:12824803-12935290   110488  N/A     NC_034584.1     13590   17932   31317   0.0     19221
NC_000081.6:12824803-12935290   110488  N/A     NC_034584.1     5296    84587   89824   0.0     7912
...etc

Sort to obtain a 'path':

$ sort -k1,1 -k6,6n   Mouse > MouseSort

Returns a ordered list of hits, but many of virtually the same hit ie.:

NC_000081.6:12824803-12935290   110488  N/A     NC_034581.1     356     5400    5754    9.48e-145       527
NC_000081.6:12824803-12935290   110488  N/A     NC_034605.1     356     5400    5754    5.74e-137       501
NC_000081.6:12824803-12935290   110488  N/A     NC_034613.1     358     5400    5755    3.48e-129       475
NC_000081.6:12824803-12935290   110488  N/A     NC_034574.1     321     5401    5721    4.57e-118       438
NC_000081.6:12824803-12935290   110488  N/A     NC_034574.1     354     5401    5754    9.68e-130       477

how to remove the superfluous hits? Work with BLAST parameters, ie:

blastn -db lib_mouse.fna -query drosha.fasta -qcov_hsp_perc 20 -evalue 1e-15 -num_threads 20  > Mouse1
DCGenomics commented 6 years ago

Multithreading

On Wed, Nov 29, 2017 at 3:52 PM, Robinson, JM notifications@github.com wrote:

BLASTs are taking like 10-15 min, the blastDB is only 3 genomes large and query is only 8 sequences. What to do with libraries of high level taxonomy, such as Insects or Protostomes, that have 100s of genomes?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/GeneFamTaxScan/issues/2#issuecomment-347992430, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtdOqJ78_aLWNbYUsrALz5sUsgOXmks5s7cP6gaJpZM4Qtxu9 .

-- What have you done today to make the world a better place?

DCGenomics commented 6 years ago

Tao Tao may have some suggestions. If not, we'll talk to Tom Madden (head of BLAST group)

On Thu, Nov 30, 2017 at 11:11 AM, Robinson, JM notifications@github.com wrote:

BLAST Output returns many contiguous matches -

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 14583 90744 105108 0.0 20572 NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 13590 17932 31317 0.0 19221 NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 5296 84587 89824 0.0 7912 ...etc

Sort to obtain a 'path':

$ sort -k1,1 -k6,6n Mouse > MouseSort

Returns a ordered list of hits, but many of virtually the same hit ie.:

NC_000081.6:12824803-12935290 110488 N/A NC_034581.1 356 5400 5754 9.48e-145 527 NC_000081.6:12824803-12935290 110488 N/A NC_034605.1 356 5400 5754 5.74e-137 501 NC_000081.6:12824803-12935290 110488 N/A NC_034613.1 358 5400 5755 3.48e-129 475 NC_000081.6:12824803-12935290 110488 N/A NC_034574.1 321 5401 5721 4.57e-118 438 NC_000081.6:12824803-12935290 110488 N/A NC_034574.1 354 5401 5754 9.68e-130 477

how to remove the superfluous hits?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/GeneFamTaxScan/issues/2#issuecomment-348236209, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtbGwSqpZUv0xQ4pFgNH0lePeSuCHks5s7tOUgaJpZM4Qtxu9 .

-- What have you done today to make the world a better place?

DCGenomics commented 6 years ago

Hi,

BLAST does not know the biological “spurious” nature of the matches it found in a search. Given a large chunk of genomic, particularly in intronic region, you have higher likeliness of including repeats, which will find matches to similar repeats present in other area.

Without the details of the search, I am not sure what specific measures to take. There are a few general aspects you may want to consider:

Hope this helps. If you have more questions, please write back with details on the query, the target database, and the exact commandline used.

Tao

From: Ben Busby [mailto:ben.busby@gmail.com] Sent: Thursday, November 30, 2017 12:53 PM To: NCBI-Hackathons/GeneFamTaxScan reply@reply.github.com; Tao, Tao (NIH/NLM/NCBI) [E] tao@ncbi.nlm.nih.gov Subject: Re: [NCBI-Hackathons/GeneFamTaxScan] BLAST search contig/assembly-level genome sequence from a taxonomic subset (#2)

Tao Tao may have some suggestions. If not, we'll talk to Tom Madden (head of BLAST group)

On Thu, Nov 30, 2017 at 11:11 AM, Robinson, JM notifications@github.com<mailto:notifications@github.com> wrote:

BLAST Output returns many contiguous matches -

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 14583 90744 105108 0.0 20572

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 13590 17932 31317 0.0 19221

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 5296 84587 89824 0.0 7912

...etc

Sort to obtain a 'path':

$ sort -k1,1 -k6,6n Mouse > MouseSort

Returns a ordered list of hits, but many of virtually the same hit ie.:

NC_000081.6:12824803-12935290 110488 N/A NC_034581.1 356 5400 5754 9.48e-145 527

NC_000081.6:12824803-12935290 110488 N/A NC_034605.1 356 5400 5754 5.74e-137 501

NC_000081.6:12824803-12935290 110488 N/A NC_034613.1 358 5400 5755 3.48e-129 475

NC_000081.6:12824803-12935290 110488 N/A NC_034574.1 321 5401 5721 4.57e-118 438

NC_000081.6:12824803-12935290 110488 N/A NC_034574.1 354 5401 5754 9.68e-130 477

how to remove the superfluous hits?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/NCBI-Hackathons/GeneFamTaxScan/issues/2#issuecomment-348236209, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFePtbGwSqpZUv0xQ4pFgNH0lePeSuCHks5s7tOUgaJpZM4Qtxu9.

-- What have you done today to make the world a better place?

DCGenomics commented 6 years ago

I dont think the hits are spurious, just somewhat duplicative.

If you were to use the ~"-max_results" (sp) flag,could you reduce these to the best (longest within an expect value range)?

On Thu, Nov 30, 2017 at 1:44 PM, Tao, Tao (NIH/NLM/NCBI) [E] < tao@ncbi.nlm.nih.gov> wrote:

Hi,

BLAST does not know the biological “spurious” nature of the matches it found in a search. Given a large chunk of genomic, particularly in intronic region, you have higher likeliness of including repeats, which will find matches to similar repeats present in other area.

Without the details of the search, I am not sure what specific measures to take. There are a few general aspects you may want to consider:

  • Start with cleaner, more stringent search settings (-dust and others under Query filtering …, and restrict search …. Section)
  • Or prep masking the query with lower case and combine with -lcase_masking
  • Asking results that have higher percent identity, and lower expect

Hope this helps. If you have more questions, please write back with details on the query, the target database, and the exact commandline used.

Tao

From: Ben Busby [mailto:ben.busby@gmail.com] Sent: Thursday, November 30, 2017 12:53 PM To: NCBI-Hackathons/GeneFamTaxScan <reply+00578fb5192c42851c12a73b956b40 bd2cec6d772cf0e58392cf000000011637ef9492a169ce108a6cbf@reply.github.com>; Tao, Tao (NIH/NLM/NCBI) [E] tao@ncbi.nlm.nih.gov Subject: Re: [NCBI-Hackathons/GeneFamTaxScan] BLAST search contig/assembly-level genome sequence from a taxonomic subset (#2)

Tao Tao may have some suggestions. If not, we'll talk to Tom Madden (head of BLAST group)

On Thu, Nov 30, 2017 at 11:11 AM, Robinson, JM notifications@github.com wrote:

BLAST Output returns many contiguous matches -

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 14583 90744 105108 0.0 20572

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 13590 17932 31317 0.0 19221

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 5296 84587 89824 0.0 7912

...etc

Sort to obtain a 'path':

$ sort -k1,1 -k6,6n Mouse > MouseSort

Returns a ordered list of hits, but many of virtually the same hit ie.:

NC_000081.6:12824803-12935290 110488 N/A NC_034581.1 356 5400 5754 9.48e-145 527

NC_000081.6:12824803-12935290 110488 N/A NC_034605.1 356 5400 5754 5.74e-137 501

NC_000081.6:12824803-12935290 110488 N/A NC_034613.1 358 5400 5755 3.48e-129 475

NC_000081.6:12824803-12935290 110488 N/A NC_034574.1 321 5401 5721 4.57e-118 438

NC_000081.6:12824803-12935290 110488 N/A NC_034574.1 354 5401 5754 9.68e-130 477

how to remove the superfluous hits?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/GeneFamTaxScan/issues/2#issuecomment-348236209, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtbGwSqpZUv0xQ4pFgNH0lePeSuCHks5s7tOUgaJpZM4Qtxu9 .

--

What have you done today to make the world a better place?

-- What have you done today to make the world a better place?

DCGenomics commented 6 years ago

Ben,

You need help us help you, specifically by providing the details and thus the context, so that we can take a look more direct.

Near identical hits are very likely from repeats, if you want to play around with the setting, you can try -max_hsps and -max_target_seqs , but that may have other side-effect.

Tao

From: Ben Busby [mailto:ben.busby@gmail.com] Sent: Thursday, November 30, 2017 1:51 PM To: Tao, Tao (NIH/NLM/NCBI) [E] tao@ncbi.nlm.nih.gov Cc: NCBI-Hackathons/GeneFamTaxScan reply@reply.github.com Subject: Re: [NCBI-Hackathons/GeneFamTaxScan] BLAST search contig/assembly-level genome sequence from a taxonomic subset (#2)

I dont think the hits are spurious, just somewhat duplicative.

If you were to use the ~"-max_results" (sp) flag,could you reduce these to the best (longest within an expect value range)?

On Thu, Nov 30, 2017 at 1:44 PM, Tao, Tao (NIH/NLM/NCBI) [E] tao@ncbi.nlm.nih.gov<mailto:tao@ncbi.nlm.nih.gov> wrote: Hi,

BLAST does not know the biological “spurious” nature of the matches it found in a search. Given a large chunk of genomic, particularly in intronic region, you have higher likeliness of including repeats, which will find matches to similar repeats present in other area.

Without the details of the search, I am not sure what specific measures to take. There are a few general aspects you may want to consider:

Hope this helps. If you have more questions, please write back with details on the query, the target database, and the exact commandline used.

Tao

From: Ben Busby [mailto:ben.busby@gmail.commailto:ben.busby@gmail.com] Sent: Thursday, November 30, 2017 12:53 PM To: NCBI-Hackathons/GeneFamTaxScan reply@reply.github.com<mailto:reply%2B00578fb5192c42851c12a73b956b40bd2cec6d772cf0e58392cf000000011637ef9492a169ce108a6cbf@reply.github.com>; Tao, Tao (NIH/NLM/NCBI) [E] tao@ncbi.nlm.nih.gov<mailto:tao@ncbi.nlm.nih.gov> Subject: Re: [NCBI-Hackathons/GeneFamTaxScan] BLAST search contig/assembly-level genome sequence from a taxonomic subset (#2)

Tao Tao may have some suggestions. If not, we'll talk to Tom Madden (head of BLAST group)

On Thu, Nov 30, 2017 at 11:11 AM, Robinson, JM notifications@github.com<mailto:notifications@github.com> wrote:

BLAST Output returns many contiguous matches -

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 14583 90744 105108 0.0 20572

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 13590 17932 31317 0.0 19221

NC_000081.6:12824803-12935290 110488 N/A NC_034584.1 5296 84587 89824 0.0 7912

...etc

Sort to obtain a 'path':

$ sort -k1,1 -k6,6n Mouse > MouseSort

Returns a ordered list of hits, but many of virtually the same hit ie.:

NC_000081.6:12824803-12935290 110488 N/A NC_034581.1 356 5400 5754 9.48e-145 527

NC_000081.6:12824803-12935290 110488 N/A NC_034605.1 356 5400 5754 5.74e-137 501

NC_000081.6:12824803-12935290 110488 N/A NC_034613.1 358 5400 5755 3.48e-129 475

NC_000081.6:12824803-12935290 110488 N/A NC_034574.1 321 5401 5721 4.57e-118 438

NC_000081.6:12824803-12935290 110488 N/A NC_034574.1 354 5401 5754 9.68e-130 477

how to remove the superfluous hits?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/NCBI-Hackathons/GeneFamTaxScan/issues/2#issuecomment-348236209, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFePtbGwSqpZUv0xQ4pFgNH0lePeSuCHks5s7tOUgaJpZM4Qtxu9.

-- What have you done today to make the world a better place?

-- What have you done today to make the world a better place?