DRL / blobtools-light

Light version blobtools package - NO LONGER MAINTAINED! DO NOT USE!
8 stars 8 forks source link

add support for SOAP assembly header format? #4

Open jayoung opened 8 years ago

jayoung commented 8 years ago

Hi there,

I'm beginning to use blobtools: it's really nice. Thanks for your work!

I have some SOAP assemblies I'm running this on. The fasta headers look like this:

scaffold8 11.7 scaffold16 21.8

For now I'm using corresponding bam files to get the coverage, and that's fine, although I imagine one of the SOAP output files probably has that information. I am wondering whether that second number in the SOAP sequence headers is coverage, but I can't see it described in the SOAP documentation, and it actually doesn't seem to match up very well with what blobtools outputs using my bam files, but I can imagine various reasons that might be true in my cas).

Anyway, if I run makeblobs.py on my assembly file as is, I get an error: "[ERROR] - Sequence header scaffold3300 in myBam.bam does not seem to be part of the assembly. FASTA header of sequence in assembly MUST equal reference sequence name in BAM file. Please check your input files." (I got that using the -a option to specify the assembly - got other errors if I try pretending it's a spades/velvet/abyss assembly - I guess the fasta headers of those look different again)

I can get around that error if I use a version of the assembly where I change the headers to look like this (I strip off anything after the first space):

scaffold8 scaffold16 So it looks like makeblobs.py is considering the entire header line as the sequence ID including the description field(s), rather than just the first word, which is the sequence ID.

It'd be great if there was an option to have makeblobs.py strip off the description field from the assembly sequence headers internally, so I could run it on the SOAP assemblies without having to strip off the description field and make a new copy of the assembly file.

Hopefully that'll be easy to implement - what do you think?

thanks again,

Janet Young


Dr. Janet Young

Malik lab http://research.fhcrc.org/malik/en.html

Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512 email: jayoung ...at... fhcrc.org