Open Slowlysun opened 1 year ago
This looks like some OS misconfiguration. It looks for the hdf5 plugin on /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin
path but it looks like that it does not exist there. Can you check the path if it really exists? Also, what is your operating system? Have you installed WarpSTR using conda?
I installed WarpSTR using conda. /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin path is not exist, even I can't find /work/software/anaconda3/envs/warpstr/lib/hdf5 path
This looks like some OS misconfiguration. It looks for the hdf5 plugin on
/work/software/anaconda3/envs/warpstr/lib/hdf5/plugin
path but it looks like that it does not exist there. Can you check the path if it really exists? Also, what is your operating system? Have you installed WarpSTR using conda?
I would try to activate the installed warpstr conda environment conda activate warpstr
and there install hdf5 again conda install hdf5
. If conda tells you that the requested packages are already installed that OS probably did some interference. Without knowing how to reproduce your problem it will be hard to help. Can you provide more information? What is your OS? Conda version? etc.
DEBUG_2_EXT: found these bamfiles [('test/test_input/test_run1', 'test/test_input/test_run1/mapping/mapping.bam', 'chr4:183178378-183178421', 'test_run1', 'test_run1', 'test/test_output/Human_STR_1108232/fast5')] in test/test_input DEBUG_2_EXT: Running with 2 DEBUG_2_EXT: There are 10 in test/test_input/test_run1 DEBUG_2_EXT: All raw fast5 files have been successfully extracted calling script /work/software/guppy/ont-guppy-cpu/bin/guppy_basecaller --input_path test/test_output/Human_STR_1108232/fast5/test_run1 --save_path test/test_output/Human_STR_1108232/fast5/test_run1/aux --flowcell FLO-MIN106 --kit SQK-LSK109 --fast5_out --cpu_threads_per_caller 2 ONT Guppy basecalling software version 6.0.6+8a98bbc config file: /work/software/guppy/ont-guppy-cpu/data/dna_r9.4.1_450bps_hac.cfg model file: /work/software/guppy/ont-guppy-cpu/data/template_r9.4.1_450bps_hac.jsn input path: test/test_output/Human_STR_1108232/fast5/test_run1 save path: test/test_output/Human_STR_1108232/fast5/test_run1/aux chunk size: 2000 chunks per runner: 256 minimum qscore: 9 records per file: 4000 num basecallers: 1 cpu mode: ON threads per caller: 2 Found 10 fast5 files to process. Init time: 399 ms
0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|
Caller time: 137743 ms, Samples called: 1468186, samples/s: 10658.9
Finishing up any open output files.
Basecalling completed successfully.
find test/test_output/Human_STR_1108232/fast5/test_run1/aux/workspace -maxdepth 1 -name *.fast5 -print0 | xargs -0 cp -t test/test_output/Human_STR_1108232/fast5/test_run1/annot;
qt.qpa.xcb: X server does not support XInput 2
failed to get the current screen resources
DEBUG_3_TREX: Extracting TRs - running with 2 threads
Traceback (most recent call last):
File "WarpSTR.py", line 91, in
I just checked and I did not have any problems using mentioned versions. However I used older guppy version, maybe that is the problem. Have you used guppy of that version before? Do you had any problems? I do not have experience with newer guppy versions but maybe the problem is that it produces .fast5 files that are VBZ compressed, and thus requiring decompressing plugin. So I would try to install that plugin.
I see that conda now offers a package for that, so this should be easier option. Just activate the warpstr environment and run: conda install ont_vbz_hdf_plugin -c bioconda
, this should install the plugin at /work/software/anaconda3/envs/warpstr/hdf5/lib/plugin/
and also properly set the env variable HDF5_PLUGIN_PATH to that path. Please verify by echo $HDF5_PLUGIN_PATH
if it was set correctly, or modify it to match the path where plugin was installed.
The other option is to download the plugin from https://github.com/nanoporetech/vbz_compression/releases, install it somewhere and modify the $HDF5_PLUGIN_PATH to point to the directory where you install it. Or you can just create the missing directory /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin
and install the plugin there.
Thank you very much, I can successfully run the scritp after I installed plugin. But when I set ‘motif’ instead of ‘sequence’, there is still an error.
..........................
Finishing up any open output files.
Basecalling completed successfully.
find test/test_output/Human_STR_1108232/fast5/test_run1/aux/workspace -maxdepth 1 -name *.fast5 -print0 | xargs -0 cp -t test/test_output/Human_STR_1108232/fast5/test_run1/annot;
Finished Guppy basecalling for test/test_output/Human_STR_1108232/fast5/test_run1
qt.qpa.xcb: X server does not support XInput 2
failed to get the current screen resources
DEBUG_3_TREX: Extracting TRs - running with 16 threads
Traceback (most recent call last):
File "WarpSTR.py", line 91, in
Looks like that you specified lowercase characters in motif, there must be only uppercase characters (at least in this version)
I used the config file of the test(config_test.yaml), but still reported an error.
loci:
When a motif is provided instead of sequence, WarpSTR produces the sequence automatically from the reference file using locus coordinates. In your case reference file contained lowercase characters somewhere in that locus region. We did not test such a case so we missed this. Thank you. We released a bugfix and now it should be working with lowercase letters found in the reference fasta. So please run git pull in the cloned repository and run the test again.
After run git pull, I run python WarpSTR.py test/config_test.yaml, but I get another error:
................................. Finished Guppy basecalling for test/test_output/Human_STR_1108232/fast5/test_run1 qt.qpa.xcb: X server does not support XInput 2 failed to get the current screen resources DEBUG_3_TREX: Extracting TRs - running with 16 threads multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/work/software/warpstr/src/caller/wrapper.py", line 289, in warpstr_call_parallel return warpstr.run(input_data.signal) File "/work/software/warpstr/src/caller/caller.py", line 133, in run rescaled_signal2 = rescale_signal(rescaled_signal, resc_alignment) File "/work/software/warpstr/src/caller/caller.py", line 311, in rescale_signal spline = interpolate.splrep(filt_val, filt_exp, s=len(filt_val)) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/scipy/interpolate/fitpack.py", line 291, in splrep res = _impl.splrep(x, y, w, xb, xe, k, task, s, t, full_output, per, quiet) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/scipy/interpolate/_fitpack_impl.py", line 461, in splrep raise TypeError('m > k must hold') TypeError: m > k must hold """
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "WarpSTR.py", line 93, in
Was this error produced when you set the motif
in the config? Further, when you used the sequence
in the configuration file have you run into this error?
If you used motif
in the config, you can check the automatically produced sequence. There is a sequence.txt
file in the test_output directory, the relative path from the cloned repo should be test/test_output/Human_STR_1108232/sequence.txt
. Please verify that the automatically produced sequence is correct, it should be (AAAT)
.
When I used the sequence in the configuration file, it works fine. But when I used the motif, I got that error. When I used motif in the config, I can find test/test_output/Human_STR_1108232/sequence.txt.
$ cat test/test_output/Human_STR_1108232/sequence.txt AAATAAATAAATAAATAAATAAATAAATAAATAAATAAATAAAT
Well that is the problem, it should be (AAAT)
, so it means WarpSTR generated sequence incorrectly from the motif and reference locus. We released another bugfix addressing the issue when the reference fasta file is of a mix of upper and lowercase characters. Please run git pull again and rerun the test.
I would also thank you for your patience and for providing feedback. Really appreciated.
I run git pull and return the test, it works well. Then I run my own data, there are AAATG and AAATA two motifs, AAATA can get result, and the file of sequence.txt is (AAATA)AAAT, but AAATG still reports the same error, and the file of sequence.txt is AAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAATAAAAT. Can this software distinguish between two motifs that are similar?
These two motifs that you mention are at one locus? or it is two loci, each with its own motif? Please provide the loci configuration so I can verify if it is ok.
I set the same coord (chr8:119379052-119379120), but the motif of AAATA has results, AAATG reports an error.
####### the file of config.yaml loci:
In fact, the two motifs should be in different regions, AAATG is located at chr8:119379052-119379055 and AAAAT is located at chr8:119379056-119379120, Then I modify the config.yaml file. I get an error. As you said earlier, the file sequence.txt is not the same as the set motif.
$ cat sequence.txt AAAT
######## the file of config.yaml loci:
######### error infomation failed to get the current screen resources DEBUG_3_TREX: Extracting TRs - running with 16 threads multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/work/software/warpstr/src/caller/wrapper.py", line 289, in warpstr_call_parallel return warpstr.run(input_data.signal) File "/work/software/warpstr/src/caller/caller.py", line 133, in run rescaled_signal2 = rescale_signal(rescaled_signal, resc_alignment) File "/work/software/warpstr/src/caller/caller.py", line 311, in rescale_signal spline = interpolate.splrep(filt_val, filt_exp, s=len(filt_val)) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/scipy/interpolate/fitpack.py", line 291, in splrep res = _impl.splrep(x, y, w, xb, xe, k, task, s, t, full_output, per, quiet) File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/scipy/interpolate/_fitpack_impl.py", line 461, in splrep raise TypeError('m > k must hold') TypeError: m > k must hold """
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "WarpSTR.py", line 93, in
Your coordinates look very weird. Are coordinates for the locus correct for the used genome? There can be differences when using hg17 vs hg18 references so please verify. It is crucial to set coordinates precisely and also to use the same reference that was used for mapping.
Further, if there are two motifs you need to decide whether they are part of one locus or it is a different locus. Looking at your coordinates it looks like you have just one locus. So there should be only one locus entry in the configuration file in your case and provide multiple motifs or sequence.
Additionally, if you presume that the locus deviates from the reference in motifs, for example, if you presume the presence of insertions of some other motif that do not exist in a reference, you need to manually specify the sequence
, not motif
. Please see the guide for setting sequence
in config, or see below for an example:
If you presume that locus consists of insertions of AAATG repeats followed by AAAAT repeats, it should be like this:
loci:
- name: SMAD12
coord: ....
sequence: (AAATG)(AAAAT)
If there are some other presumptions for example interruption, you could get better results if you provide it in the sequence. For example if you presume an AAAT interruption between your repeating motifs the sequence should be defined as: (AAATG)AAAT(AAAAT)
. The number of repeats is estimated separately for each motif.
I use hg19 as a reference genome, there are two repeats insert motifs. AAATG is insert in chr8:119379052-119379055, AAAAT is insert in chr8:119379056-119379120, AAATG repeats followed by AAAAT repeats, (AAATG)n(AAAAT)n,so I set config file:
loci:
I get an error too.
##################################
......................
Finishing up any open output files.
Basecalling completed successfully.
find /NGSDATA/Nanopore/20230706/no_sample/20230706_1132_MN36497_FAT78428_f2a03b54/fastq_pass/20230710_sample11-12/STR_20230809/sample11_STR_0828/SMAD12/fast5/sample11/aux/workspace -maxdepth 1 -name *.fast5 -print0 | xargs -0 cp -t /NGSDATA/Nanopore/20230706/no_sample/20230706_1132_MN36497_FAT78428_f2a03b54/fastq_pass/20230710_sample11-12/STR_20230809/sample11_STR_0828/SMAD12/fast5/sample11/annot;
Finished Guppy basecalling for /NGSDATA/Nanopore/20230706/no_sample/20230706_1132_MN36497_FAT78428_f2a03b54/fastq_pass/20230710_sample11-12/STR_20230809/sample11_STR_0828/SMAD12/fast5/sample11
qt.qpa.xcb: X server does not support XInput 2
failed to get the current screen resources
DEBUG_3_TREX: Extracting TRs - running with 16 threads
Warning: high similarity of state values in reverse pattern ATTTT
Results stored in overview file /NGSDATA/Nanopore/20230706/no_sample/20230706_1132_MN36497_FAT78428_f2a03b54/fastq_pass/20230710_sample11-12/STR_20230809/sample11_STR_0828/SMAD12/overview.csv
Running complex genotyping as complex repeat units present: ['(AAATG)', '(AAAAT)']
Allele lengths as given by WarpSTR: (65, 585)
Allele lengths as given by basecall: (59, 592)
Traceback (most recent call last):
File "WarpSTR.py", line 93, in
But I find two file in result path.
The mentioned error is in the plotting phase, which is the final phase, so all results should be present even with this error.
For the description of outputs, see our documentation page. To answer your 2 questions:
predictions/sequences/all.fasta
contains sequences as predicted by WarpSTR. Similarly there is basecalls/sequences/all.fasta
file with sequences as they were basecalled and extracted for the given locus coordinates. Both all.fasta
files are further split in other two files based on the strand information, i.e. into reverse and template fasta files.There are two motif, (AAATG)(AAAAT), but only a result, which motif is this result? And what is the result of the other motif? Allele lengths as given by WarpSTR: (65, 585) Allele lengths as given by basecall: (59, 592)
These are allele lengths, so it is two numbers, one for each allele - the whole locus encompassing both motifs. Also note that this is a length, so it is a number of bases not number of repeats. So Allele lengths as given by WarpSTR: (65, 585)
means that there is one allele with 65 bases and the second allele with 585 bases.
Further, we have released a bugfix for your plotting issue. Please do the git pull and re run. Please note, that in the config you have options like:
single_read_extraction: True # Extracts reads mapped to the locus and stores them in single .fast5 format
guppy_annotation: True # Annotates .fast5 files with mapping between basecalled sequence and the signal
exp_signal_generation: True # Generates expected signals for flanks and repeats
tr_region_extraction: True # Finds tandem repeat region in read using alignment of basecalled sequence and reference repeat sequence
tr_region_calling: True # Uses state automata with DTW alignment to find the number of repeats for each signal
genotyping: True # Predicts the final allele lengths from the predicted repeat numbers of each read
as now you only need to re-run the genotyping phase (where your plotting occurred), you can just set previous steps to False
except for genotyping
, now only genotyping will be executed as results from previous phases are stored in the output folder.
As you have multiple motifs you should check predictions for complex STRs on path predictions/complexSTR_analysis/
. There is complex_repeat_units.csv
file with estimated repeats of each motif for each read. And when you rerun after git pull, there should be also complex_alleles.csv
file with summarized (clustered) predictions, something like:
unit,allele1_repeats,allele2_repeats
AAATG,5,31
AAAAT,8,8
meaning that the software predicted one allele with 5 AAATG and 8 AAAAT repeats, and the second allele with 31 AAATG and 8 AAAAT repeats. If the second allele has no values, it means that the allele is predicted to be homozygous.
At last, I saw at your log trace the following Warning: high similarity of state values in reverse pattern ATTTT
this means that nanopore signals for ATTTT repeats are difficult to distinguish. I would check visualizations (at summaries
directory) that are strand specific to see whether the before-mentioned issue influenced the predictions.
Thank you very much, it works well after I run git pull. But I'm wondering why the lengths are 65 and 585, but the repeats are only 1 and 13, and allele 2 is -.
######## Flank length not set for locus - using default value of 110 2023-09-12 08:46:20 Processing 1 of 1 Locus name: SMAD12 Allele lengths as given by WarpSTR: (65, 585) Allele lengths as given by basecall: (59, 592) qt.qpa.xcb: X server does not support XInput 2 failed to get the current screen resources Genotyped complex repeats in a homozygous allele: Unit: AAATG Repeats: 1 - Unit: AAAAT Repeats: 13 - There were 20722 reads for allele1 and - for allele2 Duration for whole: 00h 00m 14s
This can be due to a problem intrinsic to the used clustering algorithm. To cluster predictions and summarize them into one or two allele values, we first try to filter out noisy predictions by simple standard deviation technique and then we use Bayesian Gaussian mixture models with 2 components where we handle 2 cases:
At first WarpSTR clusters list of predicted allele lengths (dimension size = 1) and in your case it determine the locus as heterozygous with allele lengths of 65 and 585, then in case of complex repeats it tries to cluster list of predicted repeats per unit (dimension size = the number of repeat units, in your case we have 2 repeat units) and now it predicts (1 AAATG, 13 AAAAT) and (- , -). This increased dimension poses a much greater ambiguity for clustering. I think that there is a second allele which is greatly expanded but our clustering approach is not optimized for this case. You can play with clustering parameters in the config example/advanced_params.yaml
:
# Parameters for genotyping
genotyping_config:
min_weight: 0.2 # When one mixture model component has lower weight than this, the allele is declared as homozygous
std_filter: 2 # Filter out predictions which are out of range: mean +- 2*std_filter
As there is one expanded allele, some of those predictions could be filtered before the clustering. So I would also try to play with std_filter
, try to increase it to an extreme value like 1000 so there would be no filtering. You have a huge amount of reads, so in your case the clustering should still produce a robust result.
At last, I would like to note that we designed this software mainly for estimating repeat lengths for individual reads. In your case, in predictions/complex_repeat_units.csv
you already have the repeat length estimations per individual reads, so I would recommend to work with them, I would start with exploring the repeat frequencies and whether there is some strand differences. In previous comment I also wrote about the warning that occurred in your log trace: Warning: high similarity of state values in reverse pattern ATTTT
this means that nanopore signal for repeating ATTTT does not resemble any unique pattern but just a straight line, so I presume that there is much greater variance of repeat length predictions for reverse reads. If you confirm this presumption from exploration of repeat lengths then I would use just predictions from the template strand and cluster those.
When I check whether WarpSTR works correctly by running: bash run_test_case.sh , I get this traceback:
ONT Guppy basecalling software version 6.0.6+8a98bbc config file: /work/software/guppy/ont-guppy-cpu/data/dna_r9.4.1_450bps_hac.cfg model file: /work/software/guppy/ont-guppy-cpu/data/template_r9.4.1_450bps_hac.jsn input path: test/test_output/Human_STR_1108232/fast5/test_run1 save path: test/test_output/Human_STR_1108232/fast5/test_run1/aux chunk size: 2000 chunks per runner: 256 minimum qscore: 9 records per file: 4000 num basecallers: 1 cpu mode: ON threads per caller: 2 Found 10 fast5 files to process. Init time: 333 ms
0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----|
Caller time: 172533 ms, Samples called: 1468186, samples/s: 8509.59 Finishing up any open output files. Basecalling completed successfully. find test/test_output/Human_STR_1108232/fast5/test_run1/aux/workspace -maxdepth 1 -name *.fast5 -print0 | xargs -0 cp -t test/test_output/Human_STR_1108232/fast5/test_run1/annot; qt.qpa.xcb: X server does not support XInput 2 failed to get the current screen resources DEBUG_3_TREX: Extracting TRs - running with 2 threads Traceback (most recent call last): File "WarpSTR.py", line 91, in
run()
File "WarpSTR.py", line 71, in run
overview_df, collapsed_df = main_wrapper(locus, main_config.threads)
File "/work/software/warpstr/src/caller/wrapper.py", line 20, in main_wrapper
workload = get_workload(df_overview, locus.path)
File "/work/software/warpstr/src/caller/wrapper.py", line 52, in get_workload
norm_signal = fast5.get_data_processed((row.l_start_raw, row.r_end_raw))
File "/work/software/warpstr/src/schemas/fast5.py", line 52, in get_data_processed
self.handle['Raw']['Reads'][rname]['Signal'])
File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
return array(a, dtype, copy=False, order=order)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 1005, in array
self.read_direct(arr)
File "/work/software/anaconda3/envs/warpstr/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 966, in read_direct
self.id.read(mspace, fspace, dest, dxpl=self._dxpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 192, in h5py.h5d.DatasetID.read
File "h5py/_proxy.pyx", line 112, in h5py._proxy.dset_rw
OSError: Can't read data (can't open directory: /work/software/anaconda3/envs/warpstr/lib/hdf5/plugin)
This looks like the software is missing files.