SJTU-CGM / HUPAN

Human pan-genome analysis pipeline
http://cgm.sjtu.edu.cn/hupan/
29 stars 6 forks source link

Can hupan alignContig dealt with the samples at the same time or Using multithread #8

Closed liufy11 closed 4 years ago

liufy11 commented 4 years ago

It seems that hupan alignContig align samples one by one , and thereis no option for multithread in the usage of alignContig. As a result, hupan alignContig runs too slow . Can you tell me how to spend up ?

zhqduan commented 4 years ago

Thank you for using HUPAN! Yes, 'hupan alignContig' aligns samples one by one. Alternatively, if you are working in the computers implemented LSF system or SLURM system, we suggested that you can use 'hupanSLURM align' or 'hupanLSF align' for multiple samples in parallel. The alignment tool in this step is MUMmer and it needs a large memory for aligning human assembled contigs to human reference genome. The tool will be very slow if the computer memory cannot meet the requirement.

liufy11 commented 4 years ago

,段博,您好!      首先感谢您的回复,在使用HUPAN的过程中,我遇到了一个新的问题。我是使用HUPAN来做油菜的泛基因组构建,在使用hupan extractSeq提取非参考基因组时,我发现99%的序列被提取出来。但我不是用2代序列使用SVG组装的,我事先利用3+2的序列把基因组组装到了染色体水平。然后与参考基因组比对后,再提取非参组序列。因而query长度很长,所以Mummer比对时,序列被分割成多个片段。导致coods文件中的COV R值很低,单条片段比对结果很少超过95%。而extractSeq把COV R不超过95%的序列都被归为非参组序列。我查了一下example里的运行结果,因为SGA组装的序列很短,Mummer比对的COV R大都能达到95%。我有以下几个问题: (1)extractSeq提取序列时,如果query的累加覆盖度(去掉重叠部分)达到95%,是否还会被认为是非参组序列? (2)extractSeq提取序列时,是以整个contig最为整体考量的吗,不能从中去掉与参考同源性高的部分,留下非参考的序列部分? (3)在Example中SGA组装的序列太短了,一般在几百bp,这样的序列做基因注释有意义吗?很多基因长度超多1000bp。 (4)HUPAN能利用组装好的多个基因组构建泛基因组吗,若不能,您知道有别的解决办法吗?谢谢!!!

------------------ 原始邮件 ------------------ 发件人: "Zhongqu Duan"<notifications@github.com>; 发送时间: 2020年5月10日(星期天) 晚上10:15 收件人: "SJTU-CGM/HUPAN"<HUPAN@noreply.github.com>; 抄送: "fangying"<365698105@qq.com>;"Author"<author@noreply.github.com>; 主题: Re: [SJTU-CGM/HUPAN] Can hupan alignContig dealt with the samples at the same time or Using multithread (#8)

Thank you for using HUPAN! Yes, 'hupan alignContig' aligns samples one by one. Alternatively, if you are working in the computers implemented LSF system or SLURM system, we suggested that you can use 'hupanSLURM align' or 'hupanLSF align' for multiple samples in parallel. The alignment tool in this step is MUMmer and it needs a large memory for aligning human assembled contigs to human reference genome. The tool will be very slow if the computer memory cannot meet the requirement.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

zhqduan commented 4 years ago

你好! HUPAN是针对二代测序数据组装的。如果你有染色体水平上的组装基因组结果,不建议你用HUPAN进提取非参考基因组序列,你可以参考三代测序泛基因组的论文,如https://www.nature.com/articles/s41597-020-0438-2 。针对你的问题: (1)extractSeq这一步是因为QUAST比较慢,我们先去除了一些与参考序列高度相似的序列减少了计算量。如果是三代测序数据的组装结果,建议你使用QUAST-LG直接提取非参考基因组序列,但是结果的准确性我们没有评估。 (2)见(1)。 (3)Example只是为了测试流程而设计的,没有实际的意义。SGA组装二代高深度的人类基因组的N50可达7kb以上;另外,新基因的长度一般都比较短。 (4)HUPAN可以利用组装好的多个基因组构建泛基因组,如我们的论文中利用了90个汉族人的组装结果和新组装的基因组构建了汉族人泛基因组。