fmfi-compbio / warpstr

Determining tandem repeat lengths using raw nanopore signals.
https://fmfi-compbio.github.io/warpstr/
Other
11 stars 1 forks source link

Overview file missing #1

Closed sabiqali closed 11 months ago

sabiqali commented 1 year ago

Hi,

I was trying to get set up on warpstr and use it to analyze some loci that we do have. Having installed it and updated the config file as asked in the README, I ran into some errors while running the software.

The error statement has been pasted below: 2022-11-07 11:40:40 Processing 1 of 1 Locus name: c9orf72 Flank length not set for locus - using default value of 110 Sequence was not set for c9orf72. Automatic configuration defined sequence as: CCCC(GGCCCC)GG derived from reference sequence CCCC(GGCCCC)[2]GG Not found the overview file /.mounts/labs/simpsonlab/users/schaudhary/projects/2022.10.STRr10toolkit/warpstr/output_folder/c9orf72/overview.csv - Please check the "output" in config

It then errors out with FileNotFoundError: [Errno 2] No such file or directory:

Would you be able to tell me why the overview file is not being generated? The input in question is a cell line that contains the locus and has been prepped using Cas9. I have also mentioned the output folder where I would like all the output files to be generated.

xsitarcik commented 1 year ago

Hi,

the tool runs in multiple steps as given in the config (https://github.com/fmfi-compbio/warpstr/blob/bb1b0a62f89d00ff7ec72ac98b24e2b7d68e8d81/example/config.yaml#L12-L18) The first step single_read_extraction extracts from the input data paths all .fast5 files and stores them in the output folder while also generating the overview file. I see that in the template config this step is incorrectly set to False. Sorry. Is this step in your config set to True? Please, in case of the first time running you should set flags for these steps to True.

Please, see if that does help solve the problem or errors persist. In that case, please provide your full config file. I will be glad to help.

sabiqali commented 1 year ago

Hi @xsitarcik,

That seems to have solved the issue. The overview.csv is now being generated.

But, the program is not exiting gracefully. It errored out with the error statement:

AttributeError: 'Pandas' object has no attribute 'saved'

Further, I did have a question about one of the fields in the config file, which did not have a comment on it to describe it. Could you tell me what this line is supposed to be? I just want to make sure that my config file is completely correct. Is it just the number of repeats expected in the reference? Thank you!

https://github.com/fmfi-compbio/warpstr/blob/bb1b0a62f89d00ff7ec72ac98b24e2b7d68e8d81/example/config.yaml#L34

xsitarcik commented 1 year ago

Hi, saved is a boolean attribute in the overview file denoting whether the locus was found in that particular read or not. Is tr_region_extraction flag in config set to True? It must be set to True in that case for the tool to first localize repeats in reads and save this information in the overview file.

If the tr_region_extraction flag was set to True and error persists, then please ensure that:

  1. reads are found by the tool (i.e. see the overview file to check if there are any rows with read info)
  2. coord in config corresponds to BAM mapping files and is correctly set, as then no reads are found in the extraction phase. The correspondence between coord and BAM must be complete, region names must be also equal. For example, if coord is set to region chr1 but there are no such regions in BAM (usually because they are called differently), no reads are selected for the repeat extraction phase. In that case, check BAM files to see how regions are named, and rename coord region accordingly.

As for noting field in the config, your assumption is correct - the field denotes the reference locus, usually in some concise representation. This field does not serve any other purpose than being a supplementary information for potential evaluation and comparison with locus sequences predicted by the tool.