fiberseq / fibertools-rs

Tools for fiberseq data written in rust.
https://fiberseq.github.io/fibertools/fibertools.html
42 stars 5 forks source link

MSP and NUC calling at the end of fibers #61

Closed RisaWatanabe closed 2 days ago

RisaWatanabe commented 3 days ago

Hi, I noticed that the MSPs and NUCs are not called at the edge of fibers. Below is an example of such fiber.

Is there a specific rule in fibertools-rs to not call any MSP/NUC at the edge of fibers?

Thanks.

mrvollger commented 3 days ago

Yes, there is, and its purpose is to avoid calling nucleosomes or MSPs whose starts or ends likely extend off of the Fiber. This is an adjustable parameter (-d, --distance-from-end) in ft add-nucleosomes:

ft add-nucleosomes --help
Add nucleosomes to a bam file with m6a predictions

Usage: ft add-nucleosomes [OPTIONS] [BAM] [OUT]

Arguments:
  [BAM]  Input BAM file. If no path is provided stdin is used. For m6A prediction, this should be a HiFi bam file with kinetics data. For other commands, this should be a bam file with m6A calls [default: -]
  [OUT]  Output bam file with nucleosome calls [default: -]

Options:
  -n, --nucleosome-length <NUCLEOSOME_LENGTH>                    Minium nucleosome length [default: 75]
  -c, --combined-nucleosome-length <COMBINED_NUCLEOSOME_LENGTH>  Minium nucleosome length when combining over a single m6A [default: 100]
      --min-distance-added <MIN_DISTANCE_ADDED>                  Minium distance needed to add to an already existing nuc by crossing an m6a [default: 25]
  -d, --distance-from-end <DISTANCE_FROM_END>                    Minimum distance from the end of a fiber to call a nucleosome or MSP [default: 45]
  -h, --help                                                     Print help
  -V, --version                                                  Print version

BAM-Options:
  -F, --filter <BIT_FLAG>        BAM bit flags to filter on, equivalent to `-F` in samtools view [default: 0]
  -x, --ftx <FILTER_EXPRESSION>  Filtering expression to use for filtering records Example: filter to nucleosomes with lengths greater than 150 bp -x "len(nuc)>150" Example: filter to msps with lengths between 30
                                 and 49 bp -x "len(msp)=30:50" Example: combine 2+ filter expressions -x "len(nuc)<150,len(msp)=30:50" Filtering expressions support len() and qual() functions over msp, nuc, m6a,
                                 cpg
      --ml <MIN_ML_SCORE>        Minium score in the ML tag to use or include in the output [env: FT_MIN_ML_SCORE=] [default: 125]

Global-Options:
  -t, --threads <THREADS>  Threads [default: 8]

Debug-Options:
  -v, --verbose...  Logging level [-v: Info, -vv: Debug, -vvv: Trace]
      --quiet       Turn off all logging
RisaWatanabe commented 3 days ago

Thank you Mitchel for the prompt answer. I ran 'ft predict-m6a' with default parameters to call NUCs and MSPs. The default value for the --distance-from-end parameter is 45bp, but in my previous example, both the last NUC and MSP were located more than 1kb from the end of the fiber. Is this behavior expected? Could it be that a NUC, which was called after the last MSP, was extending off the end and was thus removed from the output?

mrvollger commented 2 days ago

Yes, this situation can happen if, after I have predicted the positions of MSPs or NUCs, if they either start or end in the last 45bp, they are removed. This is because, in this situation, we believe there is a relatively high probability we have not observed the actual start or end.