databio / peppro

A modular, containerized pipeline for PRO-seq data processing
http://peppro.databio.org/
BSD 2-Clause "Simplified" License
9 stars 2 forks source link

simplify _process_fastq #25

Closed nsheff closed 4 years ago

nsheff commented 5 years ago

the _process_fastq function is more than 1000 lines of code. This should be split out into separate functions or simplified to be understandable. There's no strict line limit for functions but as a rule if your function is getting over a hundred or two lines it's probably time to split it up...

nsheff commented 5 years ago

This logical flow shows up repeatedly:

if paired_end:
    # do stuff 1
else:
    if args.complexity and args.umi_len > 0:
        # do stuff 2
    else:
        # do stuff 3
nsheff commented 5 years ago

the process should be divided into functions to minimize the over-indentation problems that made the code too long and unreadable. some function ideas:

So it would look like:

...
if not paired_end and not args.umi_len <= 0:
    dedup_cmd = _build_dedup_cmd(...)
else:
    dedup_cmd = _build_dedup_cmd(...)

if args.umi:
    if adapter == "fastp": 
        _process_umi_fastp(...)
    else:
        _process_umi_seqtk(...)
else:
    if args.trimmer == "seqtk":
        _process_no_umi_seqtk(...)
    elif args.trimmer == "fastx":
        _process_no_umi_fastx(...)
    else:
        use seq tk anyway??   <--- here the code looks to duplicate again?
...
nsheff commented 5 years ago

got the function below 800 LOC... getting there...

jpsmith5 commented 4 years ago

Reworked all these functions into separate components.