DominikBuchner / BOLDigger-commandline

BOLDigger as a commandline tool
MIT License
8 stars 0 forks source link

linebreaks in fasta file #3

Closed FabianRoger closed 2 years ago

FabianRoger commented 3 years ago

I noticed that boldigger cannot deal with linebreaks in fasta files. However many programs will create fasta files with fixed line width as it makes them easier to display. This seems to be the case for both vsearch and WriteXStringset from the Biostrings package to give two examples.

Do you think it would be possible for boldigger to support linebreaks? Only lines that start with a '>' should ever be names and anything in-between two such lines is the sequence?

I would offer help, but my python is very rudimentary I'm afraid :-)

see also here

FabianRoger commented 3 years ago

Here is an example file (it's .txt as GitHub doesn't allow me to upload .fasta files)

test_file.txt

This file should not have linebreaks and should work with boldigger (It's COI sequences)

repex in R:

library(Biostrings)

url <- "https://github.com/DominikBuchner/BOLDigger-commandline/files/6597915/original_file.txt"

download.file(url, "test_fasta.txt" )

test_fasta <- readDNAStringSet("test_fasta.txt")

writeXStringSet(test_fasta, "test_fasta_2.fasta")

the new file will have linebreaks and and will not work in boldigger.

Hope this helps!

DominikBuchner commented 3 years ago

Yes I'm aware of the problem. There is the 2-line fasta format and the fast format where a linebreak is added after 80 bp of sequence. I'll program a workaround and connect it to the issue where the fasta file gets deleted.

FabianRoger commented 3 years ago

sounds great!

FabianRoger commented 3 years ago

Just came across this, maybe this doe what you want? https://github.com/knights-lab/BURST/blob/master/embalmlets/linfasta.c

DominikBuchner commented 3 years ago

I already know a pure python solution, just don't have time to program the fix.