SorenKarst / longread_umi

GNU General Public License v3.0
76 stars 29 forks source link

Cluster Ratio #42

Closed manterd closed 2 years ago

manterd commented 3 years ago

Can you provide a more detailed explanation of the cluster ratio setting? Your publication states that it is 'the ratio between the number of UMI binned reads to the size of the UMI reference cluster' but what does this really mean in practice? Also, this value is not currently available as an input option in the nanopore_pipeline script but is instead hardcoded in the script. Interestingly, at the default setting of 10, more than 90% of the UMIs present after chimera removal are removed from the dataset but changing this value to 20 results in 50% of the UMIs being retained. Thanks, Dan

SorenKarst commented 2 years ago

Hi Dan,

The setting is accessible in the development branch of the pipeline https://github.com/SorenKarst/longread_umi/tree/develop

    -U  UMI filter settings. Define settings for:
        - UMI match error mean (UMEM): Mean match error between reads in a bin
          and the UMI reference.
        - UMI match error SD (UMESD): Standard deviation for match error between
          reads in a bin and the UMI reference.
        - Bin cluster ratio (BCR): Ratio between UMI bin size and UMI cluster size.
        - Read orientation ratio (ROR): n(+ strand reads)/n(all reads). '0' is the
          means disabled.
        Settings can be provided as a string: 'UMEM/UMESD/BCR/ROR'
        Or as a preset:
        - 'r941_min_high_g360' == '3;2;6;0.3'
        - 'r103_min_high_g360' == '3;2.5;12;0.3'

The process of defining the UMI bins starts with finding high quality UMI sequences. In this process the raw UMI sequences are heavily filtered and then clustered. The UMI cluster size is the number of high quality UMI sequences assigned to a specific UMI in this process. Depending on the error rate, this number is much lower than the actual UMI bin size. The high quality UMI's are then used to recruit raw reads. The number of raw reads recruited to a specific UMI is the UMI bin size.

As far as I remember you should be able to pull the UMI cluster size and the UMI bin size from the umi_binning_stats.txt file. The UMI cluster size should be in the name of the UMI in column 1 and the UMI bin size should be in column 2.

best regards Søren