Cluster Ratio - Githubissues

Hi Dan,

The setting is accessible in the development branch of the pipeline https://github.com/SorenKarst/longread_umi/tree/develop

    -U  UMI filter settings. Define settings for:
        - UMI match error mean (UMEM): Mean match error between reads in a bin
          and the UMI reference.
        - UMI match error SD (UMESD): Standard deviation for match error between
          reads in a bin and the UMI reference.
        - Bin cluster ratio (BCR): Ratio between UMI bin size and UMI cluster size.
        - Read orientation ratio (ROR): n(+ strand reads)/n(all reads). '0' is the
          means disabled.
        Settings can be provided as a string: 'UMEM/UMESD/BCR/ROR'
        Or as a preset:
        - 'r941_min_high_g360' == '3;2;6;0.3'
        - 'r103_min_high_g360' == '3;2.5;12;0.3'

The process of defining the UMI bins starts with finding high quality UMI sequences. In this process the raw UMI sequences are heavily filtered and then clustered. The UMI cluster size is the number of high quality UMI sequences assigned to a specific UMI in this process. Depending on the error rate, this number is much lower than the actual UMI bin size. The high quality UMI's are then used to recruit raw reads. The number of raw reads recruited to a specific UMI is the UMI bin size.

As far as I remember you should be able to pull the UMI cluster size and the UMI bin size from the umi_binning_stats.txt file. The UMI cluster size should be in the name of the UMI in column 1 and the UMI bin size should be in column 2.

best regards Søren

SorenKarst / longread_umi

Cluster Ratio #42