Boyle-Lab / HMMSTR

MIT License
4 stars 1 forks source link

General questions: mode and usage #2

Open HLHsieh opened 3 months ago

HLHsieh commented 3 months ago

Hi Kinsey,

Thank you for your quick response to my issue; everything was fixed. I tried HMMSTR on several of my own datasets, and the experience was amazing.

However, I have several questions:

  1. There are two input modes. Could you confirm whether they return the same results or if there are any differences between them?
  2. Regarding the definition of our own input files, is the information based on a specific reference genome or is it possible to extend this to other species. I am currently using hg38.
  3. Is HMMSTR applicable to other types of repeats, and are there any length limits? I attempted to detect a 67-bp repeat in my data but encountered an error as follows:
    hmmstr targets_tsv /scratch/kinfai_root/kinfai0/hsinlun/reference/HMMSTR.bed ./test/scratch/kinfai_root/kinfai0/hsinlun/Data_R10/test/test.fasta --cpus 8
    Parameters for this HMMSTR run:
    subcommand : targets_tsv
    out : ./test
    inFile : /scratch/kinfai_root/kinfai0/hsinlun/Data_R10/test/test.fasta
    output_plots : False
    output_labelled_seqs : False
    max_peaks : 2
    cpus : 8
    flanking_size : 100
    mode : map-ont
    mapq_cutoff : 30
    k : None
    w : None
    use_full_read : False
    peakcalling_method : auto
    bandwidth : None
    kernel : gaussian
    bootstrap : False
    call_width : 0.95
    resample_size : 100
    allele_specific_CIs : False
    allele_specific_plots : False
    discard_outliers : False
    filter_quantile : 0.25
    flanking_like_filter : False
    stranded_report : False
    cluster_only : False
    save_intermediates : False
    background : None
    E_probs : None
    A_probs : None
    custom_RM : None
    hmm_pre : None
    targets : /scratch/kinfai_root/kinfai0/hsinlun/reference/HMMSTR.bed 
    Target tsv input selected! Reading input from  /scratch/kinfai_root/kinfai0/hsinlun/reference/HMMSTR.bed ...
    All models built! finished .... time was:  79.32333261705935
    Starting multiprocess, cpu's in use: 8
    All reads processed, the pooled run took:  1037.1308538229205
    Traceback (most recent call last):
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/bin/hmmstr", line 8, in <module>
    sys.exit(main())
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/HMMSTR/HMMSTR.py", line 773, in main
    geno_df_final = geno_df.apply(sort_outputs, args=(args.max_peaks,args.bootstrap,args.allele_specific_CIs), axis=1)
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/HMMSTR_utils/HMMSTR_utils.py", line 246, in sort_outputs
    new_row = curr_row[column_names_in_order].copy()
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/series.py", line 1153, in __getitem__
    return self._get_with(key)
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/series.py", line 1194, in _get_with
    return self.loc[key]
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1191, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1420, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1360, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis)
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexing.py", line 1558, in _get_listlike_indexer
    keyarr, indexer = ax._get_indexer_strict(key, axis_name)
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6200, in _get_indexer_strict
    self._raise_if_missing(keyarr, indexer, axis_name)
    File "/nfs/turbo/umms-kinfai/hsinlun/venv/HMMSTR/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6252, in _raise_if_missing
    raise KeyError(f"{not_found} not in index")
    KeyError: "['num_supporting_reads'] not in index"
  4. Regarding the output, genotype_calls.tsv shows:
    name    A1:median   A1:mode A1:SD   A1:supporting_reads A2:median   A2:mode A2:SD   A2:supporting_reads num_supporting_reads    bandwidth   peak_calling_method
    C9ORF72 0   0   0   0   264.0   264 0.7339693949098695  68  68  0.5 kde_throw_outliers

    I am wondering why the copy number of allele 1 is 0 while that of allele 2 is 264.0, rather than having only allele 1 with a copy number of 264.0.

read_assignments.tsv shows:

name    read_id strand  align_score neg_log_likelihood  subset_likelihood   repeat_likelihood   repeat_start    repeat_end  align_start align_end   counts  freq    cluster_assignmentsoutlier  peak_calling_method
C9ORF72 C9ORF72-8_27568299_aligned_278233_F_39_15559_10 forward 120 -1546.114   -308.153    -296.803    5254    6833    5058    7032    264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27560224_aligned_280856_F_1_38138_44  forward 120 -1726.833   -484.1819999999999  -466.743    13276   14853   13075   15051   261.0   2       True    kde_throw_outliers
C9ORF72 C9ORF72-8_27559176_aligned_334855_F_1_32063_41  forward 120 -1741.146   -499.6199999999999  -476.6540000000001  14314   15905   14114   16112   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27539687_aligned_480453_R_390_40843_20094 reverse 120 -1562.965   -316.68600000000004 -299.563    25510   27089   25311   27288   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27572376_aligned_18870_R_2_15374_0    reverse 120 -1595.043   -359.16100000000006 -331.653    12651   14227   12448   14426   263.0   21  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27552425_aligned_44556_F_1216_30940_528   forward 120 -1562.619   -316.34000000000003 -304.352    22331   23916   22131   24115   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27558694_aligned_147495_R_41_18456_9  reverse 120 -1631.929   -387.199    -343.42 2046    3625    1845    3824    263.0   21  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27518711_aligned_305149_R_43_57261_0  reverse 120 -1647.123   -411.2410000000001  -368.809    860 2438    661 2632    263.0   21  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27553579_aligned_420791_R_43_32918_1  reverse 120 -1597.8 -353.6  -326.093    11359   12942   11160   13142   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27572786_aligned_505298_F_1_10977_44  forward 120 -1621.927   -375.648    -358.381    744 2327    544 2525    264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27571470_aligned_514863_R_8_12938_30  reverse 120 -1525.204   -279.501    -262.02599999999995 9322    10906   9123    11105   264.0   37  1   False   kde_throw_outliers
C9ORF72 C9ORF72-8_27565841_aligned_527646_R_33_16083_12 reverse 120 -1636.656   -400.775    -318.755    6816    8394    6618    8598    262.0   6   1   False   kde_throw_outliers

However, all the reads were assigned to allele one. Additionally, I noticed that some reads have a blank in the cluster_assignments field. What does this mean, and how should I interpret these reads?

C9ORF72 C9ORF72-8_27560224_aligned_280856_F_1_38138_44  forward 120 -1726.833   -484.1819999999999  -466.743    13276   14853   13075   15051   261.0   2       True    kde_throw_outliers

Could you also provide a brief explanation of the "freq" column (the frequency of the copy number for the assigned target)? I supposed that this value should be less than 1, but in my results, the values are greater than 1.

Any suggestions and comments would be appreciated.

Best regards, Hsin

kvandeynze commented 1 month ago

Hi Hsin, Thank you for your interest in HMMSTR! Here are my responses for your quesitons: 1) The only file difference in outputs between the coordinates and targets_tsv modes is that under coordinates mode the *_inputs.tsv file will be produced from the input coordinates. All other outputs should be equivalent 2) The input tsv can be any target sequence based on any species you would like! The benefit of this mode is that it is 100% reference free so hg38 or any other sequence you would like to target should be compatible in this mode. 3) HMMSTR should be compatible with any tandemly repeated DNA element, however if the sequence diverges too far from the expected repeat unit HMMSTR may return that there were no reads found with that underlying repeat sequence. As for your error, this appears to be a bug where I failed to initialize the 'num_supporting_reads' column for the first target returned with no supporting reads detected. In response to this I have patched where I believe the problem was introduced but please let me know if you are still encountering this issue. 4) This was an inconsistency between the allele sorting in the genotypes vs read_assignment files, I have now updated the outputs such that the allele assignment number will be the same across both output files. To note, the alleles are sorted by size so all homozygous calls will have A1 equal to 0. As for the rows in read_assignments with no cluster assignment: these reads were called as outliers and were thus dropped before the clustering step so they were not assigned to a cluster. If you would like all reads to be clustered, the 'auto' peakcalling_method can be overridden to 'gmm' or 'kde' to prevent 'kde_throw_outliers', which automatically discards reads with copy numbers that lie outside of the IQR of the data.

Hope this helps and please let me know if you have any other questions, Kinsey