Closed mihkelvaher closed 2 years ago
Hi, I've encountered this before and it turned out to be a formatting error with my sequences file. I recommend throwing the code Haystac uses to parse the sequences file into a python shell and examining the data frame that gets created.
import pandas as pd
pd.read_table(
value,
sep="\t",
header=None,
index_col=False,
)
This is from haystac.workflow.scripts.utilities
and you should replace value
with the sequences file name.
Hello,
Hope you are doing great !
Unfortunately the error message you have posted is not pointing at any errors regarding the formatting of the file you are providing as input. From a first glance it looks more like the file object itself is invalid ? Not sure why though. I'll definitely have a look around again, in case I have missed something !
Re abundance calculation: The calculated abundance is the molecular abundance of that species in the screened library. The abundance is not normalised by the genome size. If reads can be uniquely assigned to repeat regions (indeed a bit unlikely) then they are included in the abundance calculation for that species, otherwise they are assigned to the grey matter category. Please let me know if that does not make sense.
Re positional copy number: No this is not possible with the current version of the program, but we can definitely look into it for a future a version !
Apologies for not being able to more helpful. I'll keep digging regarding the sequence input file error. Thank you for your patience and comments !
Best, Antony
Hi!
I just started with haystac and after a too long messing with installation (mamba/conda had an issue with ggplot2) I managed to get it working by cloning the git repo.
I can't get through this though:
As the docs states, there should be 3 columns in the
--sequences-file
. Keeping it really simple:still produces the error.
Any suggestions?
Also, how is abundance calculated? More specifically is it normalized with the genome/reference size? Are repeats accounted for in any way? And a bit of a stretch: is there a way to add a positional copy number to the reference for example if the reference is concatenated contigs and some of the contigs come from repeats (but are represented once in the concatenated reference).