SOAPnuke.2.2.6 -L参数用法

anshrly commented 8 months ago

hi~我最近在用SOAPnuke.2.2.6过滤时遇到一个问题，希望得到帮助，我希望获得的clean reads >=22M，因此设置了-L参数，命令行如下 SOAPnuke.2.2.6 filter -R 41011723 -L 22000000 -f AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA -r AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG -1 raw.1.fq.gz -2 raw.2.fq.gz -C clean_1.fq.gz -D clean_2.fq.gz -o result

但最后过滤得到的clean reads是小于22M的，raw reads反而是22M，所以-L这个参数是设定raw reads的数量，而非得到的clean reads数吗？ cat Basic_Statistics_of_Sequencing_Quality.txt item	raw reads(fq1)	clean reads(fq1)	raw reads(fq2)	clean reads(fq2)
Read length	150.0	150.0	150.0	150.0
Total number of reads	22000000 (100.00%)	19892573 (100.00%)	22000000 (100.00%)	19892573 (100.00%)
Number of filtered reads	2107427 (9.58%)	-	2107427 (9.58%)	-
Total number of bases	3300000000 (100.00%)	2983885950 (100.00%)	3300000000 (100.00%)	2983885950 (100.00%)
Number of filtered bases	316114050 (9.58%)	-	316114050 (9.58%)	-
Number of base A	768797016 (23.30%)	691175550 (23.16%)	765981849 (23.21%)	692845445 (23.22%)
Number of base C	858757499 (26.02%)	781336195 (26.19%)	937064495 (28.40%)	857786803 (28.75%)
Number of base G	905225714 (27.43%)	815179421 (27.32%)	835232863 (25.31%)	748190945 (25.07%)
Number of base T	765924074 (23.21%)	695247404 (23.30%)	761060830 (23.06%)	684637740 (22.94%)
Number of base N	1295697 (0.04%)	947380 (0.03%)	659963 (0.02%)	425017 (0.01%)
Q20 number	3246623126 (98.38%)	2932381006 (98.27%)	3174678684 (96.20%)	2869463274 (96.17%)
Q30 number	3132356444 (94.92%)	2823194601 (94.61%)	2970663655 (90.02%)	2682127128 (89.89%)

期待您的回复，祝好

berry08 commented 8 months ago

hi，您说的是2.1.6版本吧。-L参数是旧版本的参数，如果想保留指定数量的reads在输出文件中，需要在-c 的参数文件(比如文件名：config)中加上一行“totalReadsNum=22000000”，然后这样运行：SOAPnuke filter -c config <其他参数>。所以在您这个例子中，-L参数由于被弃用并没有生效。当然这一切的前提是过滤后的数量要大于抽取的reads数补充下totalReadsNum参数的说明，它能在过滤后的reads中抽取指定数量存在输出文件中，这里抽取方式有两种： 1、随机抽取（默认方式） 2、抽取头部的数据（totalReadsNum=22000000head），这种方式会更快，拿到足够的数据就会结束程序。

anshrly commented 8 months ago

非常感谢您的回复！我检查了一下版本，确实是2.2.6，我输入的原始数据有41M，希望过滤后能获得22M的reads，我之后尝试了将-L 22000000参数改为-L 22000000head，过滤后的数据量可以达到22M，所以我有点困惑好像不加head -L参数约束的是输入的raw reads，加上的话是能保证获得的clean数据量是我期望的。是不是应该加不加head都可以保证数据量是我要求的，只不过加上是输入前22M条，不加是随机抽取22M条呢？

berry08 commented 6 months ago

参数说明里出现-L参数，造成了误解，下次更新时会消除这个问题。正确使用方式是像上面说的：需要在-c 的参数文件(比如文件名：config)中加上一行“totalReadsNum=22000000”，您这样使用试试。

BGI-flexlab / SOAPnuke

SOAPnuke.2.2.6 -L参数用法 #65