cbg-ethz / haploclique

Viral quasispecies assembly via maximal clique finding. A method to reconstruct viral haplotypes and detect large insertions and deletions from NGS data.
GNU General Public License v3.0
25 stars 33 forks source link

Errors when reading Bam files [Conda container of haploclique] #62

Closed jblamyatifremer closed 5 years ago

jblamyatifremer commented 5 years ago

Dear Haploclique creators,

I am able to install without any problem this conda https://anaconda.org/bioconda/haploclique. When i tested haploclique -h, I get the manual without any problem.

However, when I ran haploclique on Bam I get an error "Unexpected error while reading BamFile." line 296 of the cpp program...

I also rebuilt the conda env to be sure that no error was made. I get the same error. because i am working on a cluster, I have not chance to compile haploclique from the source.

I also check if bam was corrupted, I ran gatk ValidateSamFile in verbose, I only get warnings. I have no more idea... JB

jblamyatifremer commented 5 years ago

Re, I also tried to add afterwards bamtools in the conda env... no success. I am a beginner in conda stuff. JB

DrYak commented 5 years ago

Hum, that is weird. I haven't head heard of file loading problem since I've started on the team.

A few things that might help us debugging :

Thank you very much.

jblamyatifremer commented 5 years ago

Dear DrYak

In my first script was :

$ while IFS='' read -r line || [[ -n "$BAMFILE" ]]; do [bla... bla] $ haploclique -n -L "$OUTPUT""/""$BAMFILE"".log" "$BAMFILE" $ done < "$LIST_OF_BAM_FILE"

but using a much simple simple (with the bam file distributed in github to exlcude any bam corruption), that works like a charm, haploclique doesn't like shell variable (at least in my env) :

$ haploclique -n -L /ABSOLUTE/PATH/WITHOUT/SHELL/VARIABLE/INPUT.log /ABSOLUTE/PATH/WITHOUT/SHELL/VARIABLE/INPUT.bam

However haploclique does not want to proceed my bam files... I checked again with gatk ValidateSamFile, I only get 11 :

"WARNING: Record1, Read Name XXXXXXX, QUAL field is set to * (unspecified quality scores), this allowed by SAM specification but many tools expect reads to include quantities"

I doubts that is the problem. If I did not find the problem tonight, I will send you the bam file tomorrow morning (it is not possible from my personal computer).

I have not idea on how include the gdb in my pbs script... I tried various syntax... all the time i get the manual. I am progressing ;-)

Many thanks for your time.

JB

jblamyatifremer commented 5 years ago

Dammed... I test to convert the bam to sam... I get a wonderfull error :

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated. [E::bgzf_read] bgzf_read_block error -1 after 0 of 4 bytes [main_samview] truncated file.

All the remaining bug is from my side not Haploclique... it close this issue. Sorry for annoyance.

DrYak commented 5 years ago

Okay, glad to hear that you managed to find a solution to your problems.

As a small curiosity, what for (which project) and where (which uni) are you using haploclique ?

A few extra tips :


haploclique doesn't like shell variable (at least in my env) :

Regarding your shell variables : Linux (and most other Unices, such as Mac OS X) is a bit different than Windows. Program themselves don't handle shell variables or pattern (*.bam) matching (unless specifically programmed for).

It's the role of the shell itself (bash most of the time on the command line. And whatever you put after the shebang (#!) at the top of your script) to expand the shell variable into full strings and match any pattern into multiple arguments, before passing the whole parameters list to the program.

haploclique -n -L "$OUTPUT""/""$BAMFILE"".log" "$BAMFILE"

That might be a bit confusing to the shell, due to multiple quote marks. (The behaviour of strings of adjacent quotes stringed togeter -- "like"'this' -- isn't clearly defined, and depending on the shell and versions thereof might not behave the way you would expect. I'm almost sure that this can confuse the simpler syntax used by /bin/sh )

I would suggest instead :

#!/bin/bash
haploclique -n -L "${OUTPUT}/${BAMFILE}.log" "${BAMFILE}"

This uses:


with the bam file distributed in github to exlcude any bam corruption

I would also suggest having a look at rsync instead of scp as another way to copy data with guarantees against corruption (it uses checksumming internally). It's even better, given that git is normally optimized for text (source code) whereas rsync is used for binaries. Example :

rsync -avP --inplace my.bam remoteserver:download/

(a: (archive) copies everything including attributes, permissions, etc. , v (verbose) and P (progress bar) gives you a visual output as things are progressing, --inplace don't create separate temp-file but tries to actually continue any present file - if the file is closely similar to the one you're sending, only the bits which are different will be sent. You can also use --checksum to force to control an already-finished file against corruption, instead of just looking at file size and time stamp).

Yet another way to check would be to use checksums (md5sum for speed, sha256 for security).


Regarding gdb if you want to do it as part of a shell script :

gdb -ex "run '-n' '-L' '${OUTPUT}/${BAMFILE}.log' '${BAMFILE}'" -ex "backtrace" -ex "quit" haploclique"

(or write your gdb commands (run, backtrace, quit) in a script file and pass it to gdb with gdb -x debugscript)

But, it's always simpler to log into an interactive ssh server on your cluster (if you have such dedicated interractive nodes), or ask the submitting system for an interactive shell on the regular nodes (something along the lines of bsub -ISp, qrsh, srun --pty bash -i etc. depending on the the submission system used at your cluster. Refer to the training material of the introductory training course you underwent before accessing your cluster). And then run gdb from the command line.

jblamyatifremer commented 5 years ago

Dear DrYak,

Thanks you for pedantic and usefull explanation about bash variables manipulation and tips...

I am working on Herpesvirus (double strained DNA, 200kbp) that infect bivalve (cause a lot of economic loss). Most of the litterature on the topic only conceive that only one haplotype infects one individual... But I am testing this hypothesis... I have used QuasiRecomb but Oshv1 have great deletions compare to reference and i was obliged to use QuasiRecomb on small DNA portion without great deletion. I am hopping that haploclique will solve most problems.

I am working at Ifremer [5] (at french research institut of oceans) as junior researcher, i am following all your tools because we have similar question on very different host-pathogenes system.

Cheers,

JB

On 2018-10-09 11:00, DrYak wrote:

Okay, glad to hear that you managed to find a solution to your problems.

As a small curiosity, what for (which project) and where (which uni) are you using haploclique ?

A few extra tips :

haploclique doesn't like shell variable (at least in my env) :

Regarding your shell variables : Linux (and most other Unices, such as Mac OS X) is a bit different than Windows. Program themselves don't handle shell variables or pattern (*.bam) matching (unless specifically programmed for).

It's the role of the shell itself (bash most of the time on the command line. And whatever you put after the shebang [1] (#!) at the top of your script) to expand the shell variable into full strings and match any pattern into multiple arguments, before passing the whole parameters list to the program.

haploclique -n -L "$OUTPUT""/""$BAMFILE"".log" "$BAMFILE"

That might be a bit confusing to the shell, due to multiple quote marks. (The behaviour of strings of adjacent quotes stringed togeter -- "like"'this' -- isn't clearly defined, and depending on the shell and versions thereof might not behave the way you would expect. I'm almost sure that this can confuse the simpler syntax used by /bin/sh )

I would suggest instead :

!/bin/bash

haploclique -n -L "${OUTPUT}/${BAMFILE}.log" "${BAMFILE}"

This uses:

  • full-blown bash, instead of either dash (Ubuntu) or bash (most others) with simplified sh-like syntax.
  • only one quote pair per argument passed to haploclique, to that the shell clearly that there are only 4 arguments (-n, -L, the logfile and the bamfile), no risk of splitting around multiple stringed-together quotes.
  • curly braces [2] (instead of multiple quote) are used to mark the variable names (as opposed to other text that must be passed as-is) instead of quotes.

with the bam file distributed in github to exlcude any bam corruption

I would also suggest having a look at rsync instead of scp as another way to copy data with guarantees against corruption (it uses checksumming internally). It's even better, given that git is normally optimized for text (source code) whereas rsync is used for binaries. Example :

rsync -avP --inplace my.bam remoteserver:download/

(a: (archive) copies everything including attributes, permissions, etc. , v (verbose) and P (progress bar) gives you a visual output as things are progressing, --inplace don't create separate temp-file but tries to actually continue any present file - if the file is closely similar to the one you're sending, only the bits which are different will be sent. You can also use --checksum to force to control an already-finished file against corruption, instead of just looking at file size and time stamp).

Yet another way to check would be to use checksums (md5sum for speed, sha256 for security).

  • use md5sum file.bam on both sides to display checksums.
  • or md5sum file.bam | tee MD5SUMS on source to generate a list
  • and md5sum -C MD5SUMS on the destination to compare with the list (the checksum program will tell you OK or FAILED)

Regarding gdb if you want to do it as part of a shell script :

gdb -ex "run '-n' '-L' '${OUTPUT}/${BAMFILE}.log' '${BAMFILE}'" -ex "backtrace" -ex "quit" haploclique"

(or write your gdb commands (run, backtrace, quit) in a script file and pass it to gdb with gdb -x debugscript)

But, it's always simpler to log into an interactive ssh server on your cluster (if you have such dedicated interractive nodes), or ask the submitting system for an interactive shell on the regular nodes (something along the lines of bsub -ISp, qrsh, srun --pty bash -i etc. depending on the the submission system used at your cluster. Refer to the training material of the introductory training course you underwent before accessing your cluster). And then run gdb from the command line.

-- You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub [3], or mute the thread [4].

Links:

[1] https://askubuntu.com/a/141932/295320 [2] https://www.tldp.org/LDP/abs/html/parameter-substitution.html [3] https://github.com/cbg-ethz/haploclique/issues/62#issuecomment-428116070 [4] https://github.com/notifications/unsubscribe-auth/AMn3GiYGMmVrxEiAqpNxOKicySQ6nbDVks5ujGW_gaJpZM4XMDmC [5] https://wwz.ifremer.fr/en/

SoapZA commented 5 years ago

@DrYak please keep your replies concise and to the point. This issue should not turn into a big discussion.

DrYak commented 5 years ago

@jblamyatifremer could you contact me on ivan.topolsky@bsse.ethz.ch for further discussion regarding your use case ?

armintoepfer commented 5 years ago

You know how to find out, if your BAM is corrupt?

samtools view -H input.bam

This will tell you EOF marker is absent if corrupt. Pretty easy, right :)

jblamyatifremer commented 5 years ago

Finally I found the problem ! I post the solution for newbies as me, i am struggling with that since the beginning of the week. Some reads in the input bam did not have an integer as QUAL but " * " which allowed by the standard (but not a good practice, at least, in viral genomics). Everything is working fine. happiness !