Base quality filtering - Githubissues

Ben7124 commented 3 years ago

Hi, thanks for responding so promptly to these issues! I have the pipeline working, but I was wondering does XICRA have any quality filtering parameters that we can specify to remove low quality bases? Thanks!

JFsanchezherrero commented 3 years ago

Hi there,

You are welcome, thanks for your appreciation.

Unfortunately, we do not include within XICRA any specific option to do so, but you should be able to do some filtering by using the appropriate cutadapt options in the XICRA trim module. Basically, include option --extra in you XICRA trimm module call with the appropiate options and parameters. Example command: XICRA trimm -i XICRA_analysis -a GTAGCTAGCTAGCT -A GTACGATCGAGCATGCATC --extra '-u 4 -u -4 --minimum--length 23'

Read cutadaptoptions and parameters in the following website: https://cutadapt.readthedocs.io/en/stable/

If you are interested in using any other software, you can generate your trimming or QC filtering outside XICRA and return to the pipeline for following steps (join, biotype, miRNA, tRNA, ...). You should move or copy your trimmed reads (named as sampleX_trim_R1.fastq, sampleX_trim_R2.fastq) into the given sample folder trimm (e.g. XICRA_analysis/data/sampleX/trimm folder).

Best regards, Jose

Ben7124 commented 3 years ago

Thank you Jose! I was also wondering what trimming parameters you recommend for small RNA data? For example, do you recommend minimum length be 23? Or some other number like 15? In your work do you also use a maximum length filter?

JFsanchezherrero commented 3 years ago

Hi there, The main mandatory parameters to include in the XICRA trimm module are adapters. In your case, you would need to provide 3' and 5' adapters. Take into account your library prep details for including the appropriate adapters.

As a default value for the option --min_read_len within this module, we set 15 bp. miRNAs tend to be around 20-22 bp, but multiple isoforms are possible and reads might vary on length. Also, I would not set a maximum length as long as you might have also sequenced additional small RNAs such as tRNA, piRNA, etc and you might be interested in analyzing it.

Although the XICRA biotype module works by mapping reads to the reference genome, the analysis of each specific biotype (miRNA, tRNA, etc) works by using databases with specific annotation (only miRNA, tRNA,...), so even if you have a ratio of 40% miRNA, 30 % tRNA and 30% other biotypes, you would only be able to identify the miRNA when using as the miRNA module. So, you really don't care about additional non-miRNA sequences within your data.

Check additional information for the trim module in the documentation page: https://xicra.readthedocs.io/en/latest/user_guide/modules/trimm.html

Regards,

Ben7124 commented 3 years ago

Thank you again Jose for your help!

HCGB-IGTP / XICRA

Base quality filtering #16