fmfi-compbio / warpstr

Determining tandem repeat lengths using raw nanopore signals.
https://fmfi-compbio.github.io/warpstr/
Other
11 stars 1 forks source link

can't open directory: /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin #7

Open Slowlysun opened 10 months ago

Slowlysun commented 10 months ago

When I check whether WarpSTR works correctly by running: bash run_test_case.sh , I get this traceback:

ONT Guppy basecalling software version 6.0.6+8a98bbc config file: /work/software/guppy/ont-guppy-cpu/data/dna_r9.4.1_450bps_hac.cfg model file: /work/software/guppy/ont-guppy-cpu/data/template_r9.4.1_450bps_hac.jsn input path: test/test_output/Human_STR_1108232/fast5/test_run1 save path: test/test_output/Human_STR_1108232/fast5/test_run1/aux chunk size: 2000 chunks per runner: 256 minimum qscore: 9 records per file: 4000 num basecallers: 1 cpu mode: ON threads per caller: 2 Found 10 fast5 files to process. Init time: 333 ms

0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|


Caller time: 172533 ms, Samples called: 1468186, samples/s: 8509.59 Finishing up any open output files. Basecalling completed successfully. find test/test_output/Human_STR_1108232/fast5/test_run1/aux/workspace -maxdepth 1 -name *.fast5 -print0 | xargs -0 cp -t test/test_output/Human_STR_1108232/fast5/test_run1/annot; qt.qpa.xcb: X server does not support XInput 2 failed to get the current screen resources DEBUG_3_TREX: Extracting TRs - running with 2 threads Traceback (most recent call last): File "WarpSTR.py", line 91, in run() File "WarpSTR.py", line 71, in run overview_df, collapsed_df = main_wrapper(locus, main_config.threads) File "/work/software/warpstr/src/caller/wrapper.py", line 20, in main_wrapper workload = get_workload(df_overview, locus.path) File "/work/software/warpstr/src/caller/wrapper.py", line 52, in get_workload norm_signal = fast5.get_data_processed((row.l_start_raw, row.r_end_raw)) File "/work/software/warpstr/src/schemas/fast5.py", line 52, in get_data_processed self.handle['Raw']['Reads'][rname]['Signal']) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray return array(a, dtype, copy=False, order=order) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 1005, in array self.read_direct(arr) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 966, in read_direct self.id.read(mspace, fspace, dest, dxpl=self._dxpl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5d.pyx", line 192, in h5py.h5d.DatasetID.read File "h5py/_proxy.pyx", line 112, in h5py._proxy.dset_rw OSError: Can't read data (can't open directory: /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin)

This looks like the software is missing files.

xsitarcik commented 10 months ago

This looks like some OS misconfiguration. It looks for the hdf5 plugin on /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin path but it looks like that it does not exist there. Can you check the path if it really exists? Also, what is your operating system? Have you installed WarpSTR using conda?

Slowlysun commented 10 months ago

I installed WarpSTR using conda. /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin path is not exist, even I can't find /work/software/anaconda3/envs/warpstr/lib/hdf5 path

This looks like some OS misconfiguration. It looks for the hdf5 plugin on /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin path but it looks like that it does not exist there. Can you check the path if it really exists? Also, what is your operating system? Have you installed WarpSTR using conda?

xsitarcik commented 10 months ago

I would try to activate the installed warpstr conda environment conda activate warpstr and there install hdf5 again conda install hdf5. If conda tells you that the requested packages are already installed that OS probably did some interference. Without knowing how to reproduce your problem it will be hard to help. Can you provide more information? What is your OS? Conda version? etc.

Slowlysun commented 10 months ago
  1. My operating system is Ubuntu 18.04.6;
  2. Conda version is 23.1.0;
  3. I run conda activate warpstr , and then conda install hdf5, but it still get that wrong:

DEBUG_2_EXT: found these bamfiles [('test/test_input/test_run1', 'test/test_input/test_run1/mapping/mapping.bam', 'chr4:183178378-183178421', 'test_run1', 'test_run1', 'test/test_output/Human_STR_1108232/fast5')] in test/test_input DEBUG_2_EXT: Running with 2 DEBUG_2_EXT: There are 10 in test/test_input/test_run1 DEBUG_2_EXT: All raw fast5 files have been successfully extracted calling script /work/software/guppy/ont-guppy-cpu/bin/guppy_basecaller --input_path test/test_output/Human_STR_1108232/fast5/test_run1 --save_path test/test_output/Human_STR_1108232/fast5/test_run1/aux --flowcell FLO-MIN106 --kit SQK-LSK109 --fast5_out --cpu_threads_per_caller 2 ONT Guppy basecalling software version 6.0.6+8a98bbc config file: /work/software/guppy/ont-guppy-cpu/data/dna_r9.4.1_450bps_hac.cfg model file: /work/software/guppy/ont-guppy-cpu/data/template_r9.4.1_450bps_hac.jsn input path: test/test_output/Human_STR_1108232/fast5/test_run1 save path: test/test_output/Human_STR_1108232/fast5/test_run1/aux chunk size: 2000 chunks per runner: 256 minimum qscore: 9 records per file: 4000 num basecallers: 1 cpu mode: ON threads per caller: 2 Found 10 fast5 files to process. Init time: 399 ms

0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|


Caller time: 137743 ms, Samples called: 1468186, samples/s: 10658.9 Finishing up any open output files. Basecalling completed successfully. find test/test_output/Human_STR_1108232/fast5/test_run1/aux/workspace -maxdepth 1 -name *.fast5 -print0 | xargs -0 cp -t test/test_output/Human_STR_1108232/fast5/test_run1/annot; qt.qpa.xcb: X server does not support XInput 2 failed to get the current screen resources DEBUG_3_TREX: Extracting TRs - running with 2 threads Traceback (most recent call last): File "WarpSTR.py", line 91, in run() File "WarpSTR.py", line 71, in run overview_df, collapsed_df = main_wrapper(locus, main_config.threads) File "/work/software/warpstr/src/caller/wrapper.py", line 20, in main_wrapper workload = get_workload(df_overview, locus.path) File "/work/software/warpstr/src/caller/wrapper.py", line 52, in get_workload norm_signal = fast5.get_data_processed((row.l_start_raw, row.r_end_raw)) File "/work/software/warpstr/src/schemas/fast5.py", line 52, in get_data_processed self.handle['Raw']['Reads'][rname]['Signal']) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray return array(a, dtype, copy=False, order=order) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 1005, in array self.read_direct(arr) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 966, in read_direct self.id.read(mspace, fspace, dest, dxpl=self._dxpl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5d.pyx", line 192, in h5py.h5d.DatasetID.read File "h5py/_proxy.pyx", line 112, in h5py._proxy.dset_rw OSError: Can't read data (can't open directory: /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin)

xsitarcik commented 10 months ago

I just checked and I did not have any problems using mentioned versions. However I used older guppy version, maybe that is the problem. Have you used guppy of that version before? Do you had any problems? I do not have experience with newer guppy versions but maybe the problem is that it produces .fast5 files that are VBZ compressed, and thus requiring decompressing plugin. So I would try to install that plugin.

I see that conda now offers a package for that, so this should be easier option. Just activate the warpstr environment and run: conda install ont_vbz_hdf_plugin -c bioconda, this should install the plugin at /work/software/anaconda3/envs/warpstr/hdf5/lib/plugin/ and also properly set the env variable HDF5_PLUGIN_PATH to that path. Please verify by echo $HDF5_PLUGIN_PATH if it was set correctly, or modify it to match the path where plugin was installed.

The other option is to download the plugin from https://github.com/nanoporetech/vbz_compression/releases, install it somewhere and modify the $HDF5_PLUGIN_PATH to point to the directory where you install it. Or you can just create the missing directory /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin and install the plugin there.

Slowlysun commented 10 months ago

Thank you very much, I can successfully run the scritp after I installed plugin. But when I set ‘motif’ instead of ‘sequence’, there is still an error.

.......................... Finishing up any open output files. Basecalling completed successfully. find test/test_output/Human_STR_1108232/fast5/test_run1/aux/workspace -maxdepth 1 -name *.fast5 -print0 | xargs -0 cp -t test/test_output/Human_STR_1108232/fast5/test_run1/annot; Finished Guppy basecalling for test/test_output/Human_STR_1108232/fast5/test_run1 qt.qpa.xcb: X server does not support XInput 2 failed to get the current screen resources DEBUG_3_TREX: Extracting TRs - running with 16 threads Traceback (most recent call last): File "WarpSTR.py", line 91, in run() File "WarpSTR.py", line 71, in run overview_df, collapsed_df = main_wrapper(locus, main_config.threads) File "/work/software/warpstr/src/caller/wrapper.py", line 22, in main_wrapper call_wrapper = CallerWrapper(locus, threads) File "/work/software/warpstr/src/caller/wrapper.py", line 66, in init template_seq, reverse_seq = self.get_seqs(locus.sequence, load_flanks(locus.path)) File "/work/software/warpstr/src/caller/wrapper.py", line 75, in get_seqs rev = flanks.reverse.left + self.reverse_uniq_sequence(sequence) + flanks.reverse.right File "/work/software/warpstr/src/caller/wrapper.py", line 83, in reverse_uniq_sequence rev += tmpl.ENCODING_DICT[i] KeyError: 't'

xsitarcik commented 10 months ago

Looks like that you specified lowercase characters in motif, there must be only uppercase characters (at least in this version)

Slowlysun commented 10 months ago

I used the config file of the test(config_test.yaml), but still reported an error.

loci:

xsitarcik commented 10 months ago

When a motif is provided instead of sequence, WarpSTR produces the sequence automatically from the reference file using locus coordinates. In your case reference file contained lowercase characters somewhere in that locus region. We did not test such a case so we missed this. Thank you. We released a bugfix and now it should be working with lowercase letters found in the reference fasta. So please run git pull in the cloned repository and run the test again.

Slowlysun commented 10 months ago

After run git pull, I run python WarpSTR.py test/config_test.yaml, but I get another error:

................................. Finished Guppy basecalling for test/test_output/Human_STR_1108232/fast5/test_run1 qt.qpa.xcb: X server does not support XInput 2 failed to get the current screen resources DEBUG_3_TREX: Extracting TRs - running with 16 threads multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/work/software/warpstr/src/caller/wrapper.py", line 289, in warpstr_call_parallel return warpstr.run(input_data.signal) File "/work/software/warpstr/src/caller/caller.py", line 133, in run rescaled_signal2 = rescale_signal(rescaled_signal, resc_alignment) File "/work/software/warpstr/src/caller/caller.py", line 311, in rescale_signal spline = interpolate.splrep(filt_val, filt_exp, s=len(filt_val)) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/scipy/interpolate/fitpack.py", line 291, in splrep res = _impl.splrep(x, y, w, xb, xe, k, task, s, t, full_output, per, quiet) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/scipy/interpolate/_fitpack_impl.py", line 461, in splrep raise TypeError('m > k must hold') TypeError: m > k must hold """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "WarpSTR.py", line 93, in run() File "WarpSTR.py", line 73, in run overview_df, collapsed_df = main_wrapper(locus, main_config.threads) File "/work/software/warpstr/src/caller/wrapper.py", line 23, in main_wrapper results = call_wrapper.run(workload) File "/work/software/warpstr/src/caller/wrapper.py", line 109, in run results = pool.map(warpstr_call_parallel, workload) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value TypeError: m > k must hold

xsitarcik commented 10 months ago

Was this error produced when you set the motif in the config? Further, when you used the sequence in the configuration file have you run into this error? If you used motif in the config, you can check the automatically produced sequence. There is a sequence.txt file in the test_output directory, the relative path from the cloned repo should be test/test_output/Human_STR_1108232/sequence.txt. Please verify that the automatically produced sequence is correct, it should be (AAAT).

Slowlysun commented 10 months ago

When I used the sequence in the configuration file, it works fine. But when I used the motif, I got that error. When I used motif in the config, I can find test/test_output/Human_STR_1108232/sequence.txt.

$ cat test/test_output/Human_STR_1108232/sequence.txt AAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAAT

xsitarcik commented 10 months ago

Well that is the problem, it should be (AAAT), so it means WarpSTR generated sequence incorrectly from the motif and reference locus. We released another bugfix addressing the issue when the reference fasta file is of a mix of upper and lowercase characters. Please run git pull again and rerun the test. I would also thank you for your patience and for providing feedback. Really appreciated.

Slowlysun commented 10 months ago

I run git pull and return the test, it works well. Then I run my own data, there are AAATG and AAATA two motifs, AAATA can get result, and the file of sequence.txt is (AAATA)AAAT, but AAATG still reports the same error, and the file of sequence.txt is AAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAAT. Can this software distinguish between two motifs that are similar?

xsitarcik commented 10 months ago

These two motifs that you mention are at one locus? or it is two loci, each with its own motif? Please provide the loci configuration so I can verify if it is ok.

Slowlysun commented 10 months ago

I set the same coord (chr8:119379052-119379120), but the motif of AAATA has results, AAATG reports an error.

####### the file of config.yaml loci:

In fact, the two motifs should be in different regions, AAATG is located at chr8:119379052-119379055 and AAAAT is located at chr8:119379056-119379120, Then I modify the config.yaml file. I get an error. As you said earlier, the file sequence.txt is not the same as the set motif.

$ cat sequence.txt AAAT

######## the file of config.yaml loci:

######### error infomation failed to get the current screen resources DEBUG_3_TREX: Extracting TRs - running with 16 threads multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/work/software/warpstr/src/caller/wrapper.py", line 289, in warpstr_call_parallel return warpstr.run(input_data.signal) File "/work/software/warpstr/src/caller/caller.py", line 133, in run rescaled_signal2 = rescale_signal(rescaled_signal, resc_alignment) File "/work/software/warpstr/src/caller/caller.py", line 311, in rescale_signal spline = interpolate.splrep(filt_val, filt_exp, s=len(filt_val)) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/scipy/interpolate/fitpack.py", line 291, in splrep res = _impl.splrep(x, y, w, xb, xe, k, task, s, t, full_output, per, quiet) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/scipy/interpolate/_fitpack_impl.py", line 461, in splrep raise TypeError('m > k must hold') TypeError: m > k must hold """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "WarpSTR.py", line 93, in run() File "WarpSTR.py", line 73, in run overview_df, collapsed_df = main_wrapper(locus, main_config.threads) File "/work/software/warpstr/src/caller/wrapper.py", line 23, in main_wrapper results = call_wrapper.run(workload) File "/work/software/warpstr/src/caller/wrapper.py", line 109, in run results = pool.map(warpstr_call_parallel, workload) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value TypeError: m > k must hold

xsitarcik commented 10 months ago

Your coordinates look very weird. Are coordinates for the locus correct for the used genome? There can be differences when using hg17 vs hg18 references so please verify. It is crucial to set coordinates precisely and also to use the same reference that was used for mapping.

Further, if there are two motifs you need to decide whether they are part of one locus or it is a different locus. Looking at your coordinates it looks like you have just one locus. So there should be only one locus entry in the configuration file in your case and provide multiple motifs or sequence. Additionally, if you presume that the locus deviates from the reference in motifs, for example, if you presume the presence of insertions of some other motif that do not exist in a reference, you need to manually specify the sequence, not motif. Please see the guide for setting sequence in config, or see below for an example:

If you presume that locus consists of insertions of AAATG repeats followed by AAAAT repeats, it should be like this:

loci:
  - name: SMAD12
    coord: ....
    sequence: (AAATG)(AAAAT)

If there are some other presumptions for example interruption, you could get better results if you provide it in the sequence. For example if you presume an AAAT interruption between your repeating motifs the sequence should be defined as: (AAATG)AAAT(AAAAT). The number of repeats is estimated separately for each motif.

Slowlysun commented 10 months ago

I use hg19 as a reference genome, there are two repeats insert motifs. AAATG is insert in chr8:119379052-119379055, AAAAT is insert in chr8:119379056-119379120, AAATG repeats followed by AAAAT repeats, (AAATG)n(AAAAT)n,so I set config file:

loci:

I get an error too. ################################## ...................... Finishing up any open output files. Basecalling completed successfully. find /NGSDATA/Nanopore/20230706/no_sample/20230706_1132_MN36497_FAT78428_f2a03b54/fastq_pass/20230710_sample11-12/STR_20230809/sample11_STR_0828/SMAD12/fast5/sample11/aux/workspace -maxdepth 1 -name *.fast5 -print0 | xargs -0 cp -t /NGSDATA/Nanopore/20230706/no_sample/20230706_1132_MN36497_FAT78428_f2a03b54/fastq_pass/20230710_sample11-12/STR_20230809/sample11_STR_0828/SMAD12/fast5/sample11/annot; Finished Guppy basecalling for /NGSDATA/Nanopore/20230706/no_sample/20230706_1132_MN36497_FAT78428_f2a03b54/fastq_pass/20230710_sample11-12/STR_20230809/sample11_STR_0828/SMAD12/fast5/sample11 qt.qpa.xcb: X server does not support XInput 2 failed to get the current screen resources DEBUG_3_TREX: Extracting TRs - running with 16 threads Warning: high similarity of state values in reverse pattern ATTTT Results stored in overview file /NGSDATA/Nanopore/20230706/no_sample/20230706_1132_MN36497_FAT78428_f2a03b54/fastq_pass/20230710_sample11-12/STR_20230809/sample11_STR_0828/SMAD12/overview.csv Running complex genotyping as complex repeat units present: ['(AAATG)', '(AAAAT)'] Allele lengths as given by WarpSTR: (65, 585) Allele lengths as given by basecall: (59, 592) Traceback (most recent call last): File "WarpSTR.py", line 93, in run() File "WarpSTR.py", line 79, in run run_genotyping_complex(locus.path, collapsed_df) File "/work/software/warpstr/src/genotyper/genotyping.py", line 156, in run_genotyping_complex plot_complex_repeats(df, cols, alleles, img_path) File "/work/software/warpstr/src/genotyper/plotter.py", line 18, in plot_complex_repeats split=False, scale='count', whis=np.inf, inner=None, ax=axes[idx]) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/seaborn/_decorators.py", line 46, in inner_f return f(**kwargs) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/seaborn/categorical.py", line 2400, in violinplot color, palette, saturation) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/seaborn/categorical.py", line 522, in init self.establish_variables(x, y, hue, data, orient, order, hue_order) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/seaborn/categorical.py", line 153, in establish_variables raise ValueError(err) ValueError: Could not interpret input 'allele' #######################################

But I find two file in result path.

  1. predictions/sequences/all.fasta,does this file contain all sequences in the chr8:119379052-119379120?
  2. predictions/complexSTR_analysis/complex_repeat_units.csv, does this file contain the AAATG and AAAAT counts in each sequence between chr8:119379052-119379120?
xsitarcik commented 9 months ago

The mentioned error is in the plotting phase, which is the final phase, so all results should be present even with this error.

For the description of outputs, see our documentation page. To answer your 2 questions:

  1. predictions/sequences/all.fasta contains sequences as predicted by WarpSTR. Similarly there is basecalls/sequences/all.fasta file with sequences as they were basecalled and extracted for the given locus coordinates. Both all.fasta files are further split in other two files based on the strand information, i.e. into reverse and template fasta files.
  2. Yes. The file reports counts of provided repeating motifs per each read that was found to be mapped to the provided genomic coordinates.
Slowlysun commented 9 months ago

There are two motif, (AAATG)(AAAAT), but only a result, which motif is this result? And what is the result of the other motif? Allele lengths as given by WarpSTR: (65, 585) Allele lengths as given by basecall: (59, 592)

xsitarcik commented 9 months ago

These are allele lengths, so it is two numbers, one for each allele - the whole locus encompassing both motifs. Also note that this is a length, so it is a number of bases not number of repeats. So Allele lengths as given by WarpSTR: (65, 585) means that there is one allele with 65 bases and the second allele with 585 bases.

Further, we have released a bugfix for your plotting issue. Please do the git pull and re run. Please note, that in the config you have options like:

single_read_extraction: True   # Extracts reads mapped to the locus and stores them in single .fast5 format
guppy_annotation:       True   # Annotates .fast5 files with mapping between basecalled sequence and the signal
exp_signal_generation:  True   # Generates expected signals for flanks and repeats
tr_region_extraction:   True   # Finds tandem repeat region in read using alignment of basecalled sequence and reference repeat sequence
tr_region_calling:      True   # Uses state automata with DTW alignment to find the number of repeats for each signal
genotyping:             True   # Predicts the final allele lengths from the predicted repeat numbers of each read 

as now you only need to re-run the genotyping phase (where your plotting occurred), you can just set previous steps to False except for genotyping, now only genotyping will be executed as results from previous phases are stored in the output folder.

As you have multiple motifs you should check predictions for complex STRs on path predictions/complexSTR_analysis/. There is complex_repeat_units.csv file with estimated repeats of each motif for each read. And when you rerun after git pull, there should be also complex_alleles.csv file with summarized (clustered) predictions, something like:

unit,allele1_repeats,allele2_repeats
AAATG,5,31
AAAAT,8,8

meaning that the software predicted one allele with 5 AAATG and 8 AAAAT repeats, and the second allele with 31 AAATG and 8 AAAAT repeats. If the second allele has no values, it means that the allele is predicted to be homozygous.

At last, I saw at your log trace the following Warning: high similarity of state values in reverse pattern ATTTT this means that nanopore signals for ATTTT repeats are difficult to distinguish. I would check visualizations (at summaries directory) that are strand specific to see whether the before-mentioned issue influenced the predictions.

Slowlysun commented 9 months ago

Thank you very much, it works well after I run git pull. But I'm wondering why the lengths are 65 and 585, but the repeats are only 1 and 13, and allele 2 is -.

######## Flank length not set for locus - using default value of 110 2023-09-12 08:46:20 Processing 1 of 1 Locus name: SMAD12 Allele lengths as given by WarpSTR: (65, 585) Allele lengths as given by basecall: (59, 592) qt.qpa.xcb: X server does not support XInput 2 failed to get the current screen resources Genotyped complex repeats in a homozygous allele: Unit: AAATG Repeats: 1 - Unit: AAAAT Repeats: 13 - There were 20722 reads for allele1 and - for allele2 Duration for whole: 00h 00m 14s

xsitarcik commented 9 months ago

This can be due to a problem intrinsic to the used clustering algorithm. To cluster predictions and summarize them into one or two allele values, we first try to filter out noisy predictions by simple standard deviation technique and then we use Bayesian Gaussian mixture models with 2 components where we handle 2 cases:

At first WarpSTR clusters list of predicted allele lengths (dimension size = 1) and in your case it determine the locus as heterozygous with allele lengths of 65 and 585, then in case of complex repeats it tries to cluster list of predicted repeats per unit (dimension size = the number of repeat units, in your case we have 2 repeat units) and now it predicts (1 AAATG, 13 AAAAT) and (- , -). This increased dimension poses a much greater ambiguity for clustering. I think that there is a second allele which is greatly expanded but our clustering approach is not optimized for this case. You can play with clustering parameters in the config example/advanced_params.yaml:

# Parameters for genotyping
genotyping_config:
    min_weight: 0.2             # When one mixture model component has lower weight than this, the allele is declared as homozygous
    std_filter: 2               # Filter out predictions which are out of range: mean +- 2*std_filter

As there is one expanded allele, some of those predictions could be filtered before the clustering. So I would also try to play with std_filter, try to increase it to an extreme value like 1000 so there would be no filtering. You have a huge amount of reads, so in your case the clustering should still produce a robust result.

At last, I would like to note that we designed this software mainly for estimating repeat lengths for individual reads. In your case, in predictions/complex_repeat_units.csv you already have the repeat length estimations per individual reads, so I would recommend to work with them, I would start with exploring the repeat frequencies and whether there is some strand differences. In previous comment I also wrote about the warning that occurred in your log trace: Warning: high similarity of state values in reverse pattern ATTTT this means that nanopore signal for repeating ATTTT does not resemble any unique pattern but just a straight line, so I presume that there is much greater variance of repeat length predictions for reverse reads. If you confirm this presumption from exploration of repeat lengths then I would use just predictions from the template strand and cluster those.