antonisdim / haystac

Code repository for the HAYSTAC pipeline
MIT License
12 stars 4 forks source link

haystac database: error: argument --sequences-file: invalid #12

Closed mihkelvaher closed 2 years ago

mihkelvaher commented 3 years ago

Hi!

I just started with haystac and after a too long messing with installation (mamba/conda had an issue with ggplot2) I managed to get it working by cloning the git repo.

I can't get through this though:

haystac database --mode build --sequences-file sequences_file --output  haystacDB --cores 40 -mem 148000

haystac database: error: argument --sequences-file: invalid <haystac.workflow.scripts.utilities.SequenceFileType object at 0x2b51f4a93040> value: 'sequences_file'

As the docs states, there should be 3 columns in the --sequences-file. Keeping it really simple:

head -1 sequences_file 
85172   85172   mypath/references/85172.fasta

still produces the error.

Any suggestions?

Also, how is abundance calculated? More specifically is it normalized with the genome/reference size? Are repeats accounted for in any way? And a bit of a stretch: is there a way to add a positional copy number to the reference for example if the reference is concatenated contigs and some of the contigs come from repeats (but are represented once in the concatenated reference).

Pkaps25 commented 3 years ago

Hi, I've encountered this before and it turned out to be a formatting error with my sequences file. I recommend throwing the code Haystac uses to parse the sequences file into a python shell and examining the data frame that gets created.

import pandas as pd
pd.read_table(
                value,
                sep="\t",
                header=None,
                index_col=False,
            )

This is from haystac.workflow.scripts.utilities and you should replace value with the sequences file name.

antonisdim commented 3 years ago

Hello,

Hope you are doing great !

Unfortunately the error message you have posted is not pointing at any errors regarding the formatting of the file you are providing as input. From a first glance it looks more like the file object itself is invalid ? Not sure why though. I'll definitely have a look around again, in case I have missed something !

Re abundance calculation: The calculated abundance is the molecular abundance of that species in the screened library. The abundance is not normalised by the genome size. If reads can be uniquely assigned to repeat regions (indeed a bit unlikely) then they are included in the abundance calculation for that species, otherwise they are assigned to the grey matter category. Please let me know if that does not make sense.

Re positional copy number: No this is not possible with the current version of the program, but we can definitely look into it for a future a version !

Apologies for not being able to more helpful. I'll keep digging regarding the sequence input file error. Thank you for your patience and comments !

Best, Antony