XPRESSyourself / XPRESSpipe

An alignment and analysis pipeline for Ribosome Profiling and RNA-seq data
https://xpresspipe.readthedocs.io/en/latest/
GNU General Public License v3.0
12 stars 4 forks source link

Read Pre-Processing with UMI Issue #49

Closed arish-n-shah closed 3 years ago

arish-n-shah commented 3 years ago

I am having trouble using the read pre-processing commands for reads with UMIs. I have used the McGlincy et al 2017 protocol to generate pooled ribosome footprint libraries. The resulting reads should be: {[footprint]-[5nt UMI]-[5nt sample barcode]-[adapter]}.

I am passing in the adapter sequence to the -a option, read1 to the -umi_location option, and 10 for the umi_length option: xpresspipe trim -i test/ -o processed_reads/ -a AGATCGGAAGAGCACACGTCTGAA --umi_location read1 --umi_length 10

I am getting an error which says:

Trimming reads...
Traceback (most recent call last):
  File ".../miniconda3/envs/xpresspipe/bin/xpresspipe", line 8, in <module>
    sys.exit(main())
  File ".../miniconda3/envs/xpresspipe/lib/python3.7/site-packages/xpresspipe/__main__.py", line 150, in main
    run_trim(args_dict)
  File ".../miniconda3/envs/xpresspipe/lib/python3.7/site-packages/xpresspipe/trim.py", line 221, in run_trim
    + ' -U --umi_loc ' + str(args_dict['umi_location'])
KeyError: 'umi'

Let me know if you have run into this error before. Upon taking a closer look at trim.py, in the section which deals with UMI containing reads around lines 210-230, there is call for an argument "umi" str(args_dict['umi']) which seems to breaking the call to fastp later on in the code.

# Get UMI info
    if str(args_dict['umi_location']).lower() == '3prime':
        args_dict['umi'] = ''
        args_dict['lite_umi'] = ''
        args_dict['lite_umi'] = ' -l ' + str(args_dict['umi_length']) \
            + ' -s ' + str(args_dict['spacer_length']) \
            + ' -m ' + str(args_dict['min_length'])

    elif str(args_dict['umi_location']).lower() != 'none':
        args_dict['lite_umi'] = ''
**     args_dict['umi'] = str(args_dict['umi']) \
            + ' -U --umi_loc ' + str(args_dict['umi_location'])**
        if str(args_dict['umi_length']).lower() != 'none':
            args_dict['umi'] = str(args_dict['umi']) \
                + ' --umi_len ' + str(args_dict['umi_length'])
        if int(args_dict['spacer_length']) != 0:
            args_dict['umi'] = str(args_dict['umi']) \
                + ' --umi_skip ' + str(args_dict['spacer_length'])
    else:
        args_dict['umi'] = ''
        args_dict['lite_umi'] = ''

I do not know for sure, but I have added asterisks the line which I think is the problem. I believe the args_dict[umi] that should be passed to fastp should be [-U --umi_loc=read1 --umi_length=10]. However, on this line the arguments that are being assigned to args_dict[umi] are instead [""umi" -U --umi_loc read1"]. I do not see this "umi" option in xpresspipe trim -h either.

I've included the debug info:

You are using the current version of XPRESSpipe...
======================
User commands summary:
======================
XPRESSpipe version: 0.6.2
cmd: trim
input: .../arish/RibosomeProfiling/working/xpress/210323/test/
output: .../arish/RibosomeProfiling/working/xpress/210323/processed_reads/
suppress_version_check: False
adapters: ['AGATCGGAAGAGCACACGTCTGAA']
quality: 28
min_length: 17
max_length: 0
front_trim: 1
umi_location: read1
umi_length: 10
spacer_length: 0
max_processors: 32
path: /home/arish/packages/miniconda3/envs/xpresspipe/lib/python3.7/site-packages/xpresspipe/
log_loc: .../arish/RibosomeProfiling/working/xpress/210323/processed_reads/
experiment: trim_2021_3_23_19h_19m_0s
log:  >> .../arish/RibosomeProfiling/working/xpress/210323/processed_reads/trim_2021_3_23_19h_19m_0s.log 2>&1
log_file: .../arish/RibosomeProfiling/working/xpress/210323/processed_reads/trim_2021_3_23_19h_19m_0s.log
=====================
End commands summary
=====================
arish-n-shah commented 3 years ago

While this is still a bug it seems, I personally will not be using fastp for read processing. After looking through fastp documentation further, it seems that 3' UMI are not supported. I will use umi tools to process these reads from this dataset instead.

Thank you for all of this great work. It helps a lot with data analysis.

j-berg commented 3 years ago

Hi @arish-n-shah, thanks for bringing this to my attention. I don't know how I didn't catch this in testing, but that umi argument is specific to the alignment step to let alignment post-processing know it needs to handle trimmed UMIs. You are correct that fastp is unable to handle 3' UMIs, but I modified some of the fastp source code (included in XPRESSpipe as fastp-lite) to handle 3' UMIs for ribosome profiling. For these experiments, --umi_location 3prime should be provided in the arguments anytime adapter trimming is being performed. I will work on fixing these issues and updating the docs to be more thorough describing how to do this this week and should be able to have an updated version of the software by the end of the weekend!

j-berg commented 3 years ago

Hi @arish-n-shah

I have released v0.6.3, which should fix the above mentioned issue. When processing with the --umi_location 3prime flag, UMIs will be appended to the read names and are compatible with downstream umitools processing, where it will recognize the UMI as anything after the _ in the read name. Let me know if it keeps acting up, but it looks like the unused variable was no longer causing issues in my tests.

Screenshot (65)

Also, an important note for the future -- you will need to reinstall XPRESSpipe and instead of using pip install . to install XPRESSpipe into your conda environment, you should run bash install.sh once your environment is activated. I noticed it was being inconsistent with how it installed the fastp-lite module I mentioned, so this script should handle that better.

Let me know if you run into any other issues!