About the BAM file - Githubissues

hasindu2008 commented 1 year ago

Hi,

I was going to try warpstr but while reading through the readme I got a few questions. Sorry if I overlooked something which is already in the documentation. I am trying to understand the input structure. Should the input be in single-fast5 or multi-fast5? And this BAM file which is needed as the input, is it the output from guppy that includes the move table? It would be great if a small test data is provided as an example.

xsitarcik commented 1 year ago

Hi @hasindu2008, thank you very much for questions.

Should the input be in single-fast5 or multi-fast5?

Input .fast5 files should be multi-fast5 files, if that is a problem see single_to_multi_fast5 command about converting.

and this BAM file which is needed as the input, is it the output from guppy that includes the move table?

Here, the BAM file is the output of aligning basecalled reads to the reference, for example using minimap2. WarpSTR uses the BAM file to find reads that encompass the given STR locus and extract associated .fast5 files in single-fast5 format. The BAM file does not include the move table, but the move table is produced directly in WarpSTR after the above mentioned step of extraction of .fast5 files. Extracted files are basecalled again in WarpSTR using Guppy with --fast5_out flag, which produces the move table - this table is stored in .fast5 files.

It would be great if a small test data is provided as an example.

Thank you for your suggestion of a small test data. We pushed a new version of WarpSTR containing a test data with 10 reads for one locus. Please pull the new version using git pull and try it out. Input test data are stored in test/test_input and config file is test/config.yaml. In the test data directory you can see that there is a BAM file (with index .bam.bai file) and .fast5 files. No more input files are required for WarpSTR, and also it is not required for those files to be in any particular structure, they just need to have the correct extension and be located in any depth of the input directory (as given in the config). Sadly, the test case is not completely independent, as it is required to provide a path to the reference genome and path to Guppy basecaller. You can try the test case using bash run_test_case.sh when in WarpSTR directory (and with activated conda environment), which will prompt you to provide the required paths and run the WarpSTR for you. Output files will be then stored in test/test_output/ as given in the config file.

If you have any further questions, do not hesitate to ask, I will be glad to help.

EDIT: We have also updated our documentation page. See TestCase section

hasindu2008 commented 1 year ago

Thank you for adding this test case that will be very helpful - I will test it when I manage to fix the conda issues.

The reason I asked about the move table in BAM file - is from recent versions Guppy has started outputting the move table in an unaligned SAM/BAM. Not sure if that --fast5_out still exists.

How long would it take to run this on a whole PromethION data set?

xsitarcik commented 1 year ago

Hi @hasindu2008, thank you very much for the discussion.

The reason I asked about the move table in BAM file - is from recent versions Guppy has started outputting the move table in an unaligned SAM/BAM. Not sure if that --fast5_out still exists.

We have no experience with newer versions of Guppy and the move table as BAM files. In WarpSTR, Guppy is used only to obtain an approximate mapping between signal and sequence, in order to obtain the signal representing the repeating region. The basecalled sequence is used to locally align flanking sequences, where we assume that flanking sequences are long enough (110 defaultly) to properly align them even with older and more erroneous basecaller versions (our experiments support this assumption). Furthermore, basecalled sequences are not used at all in WarpSTR for STR characterization (apart from comparison), so obtaining more accurate sequence it is not relevant there. Therefore we think that it is simpler to just use older version of Guppy that supports the --fast5_out option, if that is possible. However, thanks for the tip, we give BAM files a try in the future.

How long would it take to run this on a whole PromethION data set?

I just ran WarpSTR for one locus and a data set with ~52x coverage level. The total running time was ~45 minutes and I used only 2 threads and CPU version of guppy (ont-guppy-cpu-4.0.15, to be precise). The locus was found in 49 reads. It took up to 10 minutes to extract single fast5s, 30 minutes for Guppy to annotate fast5s, and around 5 minutes to delineate TR signals and characterize the locus.

hasindu2008 commented 1 year ago

Here is how you get the mv table

guppy_basecaller -c model  -i input/ --device cuda:0  --moves_out --bam_out  --save_path out/
samtools cat --threads 8 -o merged.bam out/*.bam
samtools fastq -T ts,ns,mv merged.bam | minimap2 -ax map-ont -y ref.fa - | samtools view -Sb - | samtools sort - > merged_aln.bam

This will give a single aligned BAM file for the dataset. You can query a particular genomic region using samtools/htslib and directly access the ts,ns and mv tags in each bam record which contain the move table for each read. For these read IDs then you can grab the corresponding raw signals, the easiest and fastest way is using BLOW5:

slow5tools f2s fast5_dir/  -d slow5_tmp/
slow5tools merge slow5_tmp/ -o merged.blow5
slow5tools index merged.blow5

Now you can use the slow5tools get or get_read function in pyslow5 to grab the signal corresponding to each read ID.

This method will scale well for doing the analysis for all str loci genome-wide for PromethION data. The added advantage is, that a directory structure is no longer needed.

If you only need the approximate location, you can even get rid of the Guppy dependency and use the f5c resquiggle or eventalign modules.

I am still having trouble with this conda. Can you please point me to the script that performs the actual str calling - i might try running manually?

xsitarcik commented 1 year ago

BLOW5 looks great, I wish that we knew the tool 2 years ago when we started developing WarpSTR.

What kind of problem do you have with conda installation?

We can prepare a script that will do only the actual STR calling, but it will take some time, currently it is part of the whole tool and has many dependencies to run independently and as is; while it also relies on the directory structure (it was however desirable to us as it helps when reanalyzing one locus, it is then possible to reanalyze only from a particular step if set in config.yaml). If a standalone script would be made, it would however require as input both, a raw signal and the approximate position of STR in it, as we first normalize the signal as a whole, and then we extract the signal part corresponding to the STR, which we work with.

tvinar commented 1 year ago

Cannot you just document the directory structure and put a particular reanalysis step in config.yaml? (or a have a script simply organize things as expected and run the master script with a corresponding restart option?)

Tomas Vinar, Associate Professor Department of Applied Informatics Faculty of Mathematics, Physics, and Informatics Comenius University, Bratislava E-mail: @.*** Office: M163 Work Phone: +421-2-60295207

On Thu, Dec 1, 2022 at 3:26 PM xsitarcik @.***> wrote:

BLOW5 looks great, I wish that we knew the tool 2 years ago when we started developing WarpSTR.

What kind of problem do you have with conda installation?

We can prepare a script that will do only the actual STR calling, but it will take some time, currently it is part of the whole tool and has many dependencies to run independently and as is; while it also relies on the directory structure (it was however desirable to us as it helps when reanalyzing one locus, it is then possible to reanalyze only from a particular step if set in config.yaml). If a standalone script would be made, it would however require as input both, a raw signal and the approximate position of STR in it, as we first normalize the signal as a whole, and then we extract the signal part corresponding to the STR, which we work with.

— Reply to this email directly, view it on GitHub https://github.com/fmfi-compbio/warpstr/issues/2#issuecomment-1333847101, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAV5KY47QBDQJP2BKCQ53NDWLCYP7ANCNFSM6AAAAAASLAGDRE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

xsitarcik commented 1 year ago

Hi @hasindu2008, we just pushed a new release of WarpSTR with simplified conda requirements and another option of installing WarpSTR using pipenv. We replaced some dependencies by our own code or already used dependencies, we hope it will be easier to install now.

Further, we prepared a script simulating the output of previous steps so then WarpSTR can be called just for STR calling, as @tvinar proposed (much appreciated). See Skipping extraction of STR regions in the readme file. The script accepts .csv file with rows representing reads, where you need to specify locus associated with the read, path to the .fast5 read, and approx. signal positions corresponding to the start and end of the flanked STR region. If you want to skip just the single fast5 extraction, see Skipping extraction of locus reads in the readme file.

fmfi-compbio / warpstr

About the BAM file #2

Cannot you just document the directory structure and put a particular reanalysis step in config.yaml? (or a have a script simply organize things as expected and run the master script with a corresponding restart option?)