General questions: mode and usage

Hi Kinsey,

Thank you for your quick response to my issue; everything was fixed. I tried HMMSTR on several of my own datasets, and the experience was amazing.

However, I have several questions:

There are two input modes. Could you confirm whether they return the same results or if there are any differences between them?
Regarding the definition of our own input files, is the information based on a specific reference genome or is it possible to extend this to other species. I am currently using hg38.

Is HMMSTR applicable to other types of repeats, and are there any length limits? I attempted to detect a 67-bp repeat in my data but encountered an error as follows:

hmmstr targets_tsv /scratch/kinfai_root/kinfai0/hsinlun/reference/HMMSTR.bed ./test/scratch/kinfai_root/kinfai0/hsinlun/Data_R10/test/test.fasta --cpus 8
Parameters for this HMMSTR run:
subcommand : targets_tsv
out : ./test
inFile : /scratch/kinfai_root/kinfai0/hsinlun/Data_R10/test/test.fasta
output_plots : False
output_labelled_seqs : False
max_peaks : 2
cpus : 8
flanking_size : 100
mode : map-ont
mapq_cutoff : 30
k : None
w : None
use_full_read : False
peakcalling_method : auto
bandwidth : None
kernel : gaussian
bootstrap : False
call_width : 0.95
resample_size : 100
allele_specific_CIs : False
allele_specific_plots : False
discard_outliers : False
filter_quantile : 0.25
flanking_like_filter : False
stranded_report : False
cluster_only : False
save_intermediates : False
background : None
E_probs : None
A_probs : None
custom_RM : None
hmm_pre : None
targets : /scratch/kinfai_root/kinfai0/hsinlun/reference/HMMSTR.bed 
Target tsv input selected! Reading input from  /scratch/kinfai_root/kinfai0/hsinlun/reference/HMMSTR.bed ...
All models built! finished .... time was:  79.32333261705935
Starting multiprocess, cpu's in use: 8
All reads processed, the pooled run took:  1037.1308538229205
Traceback (most recent call last):
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/bin/hmmstr", line 8, in <module>
sys.exit(main())
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/HMMSTR/HMMSTR.py", line 773, in main
geno_df_final = geno_df.apply(sort_outputs, args=(args.max_peaks,args.bootstrap,args.allele_specific_CIs), axis=1)
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
return op.apply().__finalize__(self, method="apply")
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
return self.apply_standard()
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
results, res_index = self.apply_series_generator()
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
results[i] = self.func(v, *self.args, **self.kwargs)
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/HMMSTR_utils/HMMSTR_utils.py", line 246, in sort_outputs
new_row = curr_row[column_names_in_order].copy()
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/series.py", line 1153, in __getitem__
return self._get_with(key)
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/series.py", line 1194, in _get_with
return self.loc[key]
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1191, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1420, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1360, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis)
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1558, in _get_listlike_indexer
keyarr, indexer = ax._get_indexer_strict(key, axis_name)
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6200, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
raise KeyError(f"{not_found} not in index")
KeyError: "['num_supporting_reads'] not in index"

Regarding the output, genotype_calls.tsv shows:

name    A1:median   A1:mode A1:SD   A1:supporting_reads A2:median   A2:mode A2:SD   A2:supporting_reads num_supporting_reads    bandwidth   peak_calling_method
C9ORF72 0   0   0   0   264.0   264 0.7339693949098695  68  68  0.5 kde_throw_outliers

I am wondering why the copy number of allele 1 is 0 while that of allele 2 is 264.0, rather than having only allele 1 with a copy number of 264.0.

read_assignments.tsv shows:

name    read_id strand  align_score neg_log_likelihood  subset_likelihood   repeat_likelihood   repeat_start    repeat_end  align_start align_end   counts  freq    cluster_assignmentsoutlier  peak_calling_method
C9ORF72 C9ORF72-8_27568299_aligned_278233_F_39_15559_10 forward 120 -1546.114   -308.153    -296.803    5254    6833    5058    7032    264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27560224_aligned_280856_F_1_38138_44  forward 120 -1726.833   -484.1819999999999  -466.743    13276   14853   13075   15051   261.0   2       True    kde_throw_outliers
C9ORF72 C9ORF72-8_27559176_aligned_334855_F_1_32063_41  forward 120 -1741.146   -499.6199999999999  -476.6540000000001  14314   15905   14114   16112   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27539687_aligned_480453_R_390_40843_20094 reverse 120 -1562.965   -316.68600000000004 -299.563    25510   27089   25311   27288   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27572376_aligned_18870_R_2_15374_0    reverse 120 -1595.043   -359.16100000000006 -331.653    12651   14227   12448   14426   263.0   21  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27552425_aligned_44556_F_1216_30940_528   forward 120 -1562.619   -316.34000000000003 -304.352    22331   23916   22131   24115   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27558694_aligned_147495_R_41_18456_9  reverse 120 -1631.929   -387.199    -343.42 2046    3625    1845    3824    263.0   21  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27518711_aligned_305149_R_43_57261_0  reverse 120 -1647.123   -411.2410000000001  -368.809    860 2438    661 2632    263.0   21  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27553579_aligned_420791_R_43_32918_1  reverse 120 -1597.8 -353.6  -326.093    11359   12942   11160   13142   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27572786_aligned_505298_F_1_10977_44  forward 120 -1621.927   -375.648    -358.381    744 2327    544 2525    264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27571470_aligned_514863_R_8_12938_30  reverse 120 -1525.204   -279.501    -262.02599999999995 9322    10906   9123    11105   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27565841_aligned_527646_R_33_16083_12 reverse 120 -1636.656   -400.775    -318.755    6816    8394    6618    8598    262.0   6   1   False   kde_throw_outliers

However, all the reads were assigned to allele one. Additionally, I noticed that some reads have a blank in the cluster_assignments field. What does this mean, and how should I interpret these reads?

C9ORF72 C9ORF72-8_27560224_aligned_280856_F_1_38138_44  forward 120 -1726.833   -484.1819999999999  -466.743    13276   14853   13075   15051   261.0   2       True    kde_throw_outliers

Could you also provide a brief explanation of the "freq" column (the frequency of the copy number for the assigned target)? I supposed that this value should be less than 1, but in my results, the values are greater than 1.

Any suggestions and comments would be appreciated.

Best regards, Hsin

Boyle-Lab / HMMSTR

General questions: mode and usage #2