baoe / AlignGraph

Algorithm for secondary de novo genome assembly guided by closely related references
166 stars 23 forks source link

BLAT/PBLAT issue "Maximum single piece size (5000) exceeded" #25

Open Schum1 opened 8 years ago

Schum1 commented 8 years ago

Hello Bao, I have assembled a de-novo genome and would like to align it to the reference genome of a close species using AlignGraph. So far so good. I run AlignGraph with the following command:

/home/bin/AlignGraph/AlignGraph/AlignGraph --read1 ../Start_fasta/Start_RawReads_FD.fasta --read2 ../Start_fasta/Start_RawReads_RD.fasta --contig ../../1_Short_Read_Assembly/MaSuRCA_1/CA/10-gapclose/genome.ctg.fasta --genome ../../../reference/assembly/ref_281_v5.0.softmasked_GCM.fa --distanceLow 100 --distanceHigh 1350 --extendedContig AlignGraph_1_extendedContigs.fa --remainingContig AlignGraph_1_remainingContigs.fa

This is a small summary of the input reads/genomes and their length distribution (AlignGraph_Issue.xlsx).

So far so good, until blat/pblat (I tested both) throws out the following error in the blat_doc.txt:

Maximum single piece size (5000) exceeded by query 1.1 of size (49814). Larger pieces will have to be split up until no larger than this limit when the -fastMap option is used.

I took the freedom to add some lines to the AlignGraph.ccp. So I know that this happened around line 3654 (AlignGraph.ccp) in the

"void * task1(void * arg)"

when

"command = "/home/bin/icebert-pblat-ed0ac17/pblat tmp/_genome." + itoa(chromosomeID) + ".fa tmp/_contigs.fa -noHead tmp/_contigs_genome." + itoa(chromosomeID) + ".psl -fastMap -threads=8 > blat_doc.txt 2> blat_doc.txt";"

is called.

Now, I understand that BLAT/PBLAT is struggling with aligning the "de-novo" contigs against the "reference" genome. Because some "de novo" contigs are >5000bp and blat/pblat requires them to be shorter than 5000bp (-fastMap flag to suppress gaps) this causes the error. Did I get it right?

Is the only possibility to split my own "de-novo" contigs to acceptable sizes, or does a workaround exist? I would like to retain the longer contigs, if possible. Else I would just proceed and split every contig longer than 5000bp into separate fasta entries.

Best regards, Ale R.

baoe commented 8 years ago

Hi, Ale,

Thank you for your interest in AlignGraph! You may find an earlier version of BLAT to process longer contigs from https://users.soe.ucsc.edu/~kent/src/. See FAQ4 for details.

Best, Bao


From: Schum1 [notifications@github.com] Sent: Friday, August 05, 2016 3:27 AM To: baoe/AlignGraph Subject: [baoe/AlignGraph] BLAT/PBLAT issue "Maximum single piece size (5000) exceeded" (#25)

Hello Bao, I have assembled a de-novo genome and would like to align it to the reference genome of a close species using AlignGraph. So far so good. I start AlignGraph with the following command:

/home/bin/AlignGraph/AlignGraph/AlignGraph --read1 ../Start_fasta/Start_RawReads_FD.fasta --read2 ../Start_fasta/Start_RawReads_RD.fasta --contig ../../1_Short_Read_Assembly/MaSuRCA_1/CA/10-gapclose/genome.ctg.fasta --genome ../../../reference/assembly/ref_281_v5.0.softmasked_GCM.fa --distanceLow 100 --distanceHigh 1350 --extendedContig AlignGraph_1_extendedContigs.fa --remainingContig AlignGraph_1_remainingContigs.fa

This is a small summary of the input reads/genomes and their length distribution (AlignGraph_Issue.xlsxhttps://github.com/baoe/AlignGraph/files/403450/AlignGraph_Issue.xlsx).

So far so good, until bldatp/blat (I tested both) throw out the following error in the blat_doc.txt:

Maximum single piece size (5000) exceeded by query 1.1 of size (49814). Larger pieces will have to be split up until no larger than this limit when the -fastMap option is used.

I took the freedom to add some lines to the AlignGrapg.ccp. So I know that this happened around line 3654 (AlignGraph.ccp) in the

"void * task1(void * arg)"

when

"command = "/home/bin/icebert-pblat-ed0ac17/pblat tmp/_genome." + itoa(chromosomeID) + ".fa tmp/_contigs.fa -noHead tmp/_contigs_genome." + itoa(chromosomeID) + ".psl -fastMap -threads=8 > blat_doc.txt 2> blat_doc.txt";"

is called.

Now, I understand that BLAT/PBLAT is struggling with aligning the "de-novo" contigs against the "reference" genome. Because some "de novo" contigs are >5000bp and blat/pblat requires them to be shorter than 5000bp (-fastMap flag to suppress gaps) this causes the error. Did I get it right?

Is the only possibility to split my own "de-novo" contigs to acceptable sizes, or does a workaround exist? I would like to retain the longer contigs, if possible. Else I would just proceed and split every contig longer than 5000bp into separate fasta entries.

Best regards, Ale R.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/baoe/AlignGraph/issues/25, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFGl8WviKK7_lyNI4zfEonC_oXmd6nMLks5qcxAYgaJpZM4Jdjq3.

Schum1 commented 8 years ago

Hi Bao, tank you very much for your quick response. Because I prefer to use multithreaded pblat, I used the following approach:

Aligngraph.ccp calls the max length for queries (5000) from pblat/blat which, on its turn, calls genoFind.h. This is where the max length for queries is set. I changed the following line in genoFind.h:

_/icebert-pblat-ed0ac17/inc/genoFind.h (LINE 380)

define MAXSINGLEPIECESIZE 5000 /_ maximum size of a single piece */

and changed it to:

define MAXSINGLEPIECESIZE 1000000 /* maximum size of a single piece */ (just an arbitrary number)

I recompiled pblat and AlignGraph. It runs just fine :)

Best, Ale

baoe commented 8 years ago

Thank you so much for this tip! I will be very helpful for other users!

Best, Bao


From: Schum1 [notifications@github.com] Sent: Tuesday, August 09, 2016 12:08 AM To: baoe/AlignGraph Cc: Bao; Comment Subject: Re: [baoe/AlignGraph] BLAT/PBLAT issue "Maximum single piece size (5000) exceeded" (#25)

Hi Bao, tank you very much for your quick response. Because I prefer to use multithreaded plat, I used the following approach:

Aligngraph.ccp calls the max length for queries (5000) from pblat/blat which, on its turn, calls genoFind.h. This is where the max length for queries is set. I changed the following line in genoFind.h:

/icebert-pblat-ed0ac17/inc/genoFind.h (LINE 380)

define MAXSINGLEPIECESIZE 5000 / maximum size of a single piece */

and changed it to:

define MAXSINGLEPIECESIZE 1000000 /* maximum size of a single piece */ (just an arbitrary number)

I recompiled pblat and AlignGraph. It runs just fine :)

Best, Ale

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/baoe/AlignGraph/issues/25#issuecomment-238471683, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFGl8XaRdCvg-hIZkT09w0b4VdX75BUsks5qeCeBgaJpZM4Jdjq3.

kzukowski commented 7 years ago

thx!!

ferrolad commented 1 year ago

Remove "-fastMap" in pblat command.

sqwwww commented 5 months ago

Remove "-fastMap" in pblat command.

thanks!