alekseyzimin / masurca

GNU General Public License v3.0
243 stars 35 forks source link

3.3.0b longest_path.sh stops prematurely #93

Open Coprosmanaut opened 5 years ago

Coprosmanaut commented 5 years ago

Hello,

I have been using v3.3.0b to assemble plant genome with Slurm. Our network has strict job number limits so I had to modify create_mega_reads.sh and jf_aligner.sh to work with parallel. Everything seemed to work well but when it comes to the longest_path.sh stage it gets stuck and stops when the mr.tmp.txt files are about 7GB. I haven't been able to diagnose what the problem is. Any help would be much appreciated!

1st attempt output: ./longest_path.sh: line 3: 85679 Killed /apps/genomics/masurca/3.3.0b/el7/AVX512/gnu-7.3/bin/longest_path -t 20 -u ../guillaumeKUnitigsAtLeast32bases_all.41.fasta -k 41 -d 0.029 -o mr.txt.1.tmp /dev/stdin
Invalid input file. File truncated
[Tue 22 Jan 21:16:12 GMT 2019] mega-reads pass 2 on the grid exited, please re-run assemble.sh slurmstepd: error: task_p_post_term: rmdir(/dev/cpuset/slurm2778416/slurm2778416.4294967294_0) failed Device or resource busy

2nd attempt (remade assemble.sh): ./longest_path.sh: line 3: 88981 Killed /apps/genomics/masurca/3.3.0b/el7/AVX512/gnu-7.3/bin/longest_path -t 20 -u ../guillaumeKUnitigsAtLeast32bases_all.41.fasta -k 41 -d 0.029 -o mr.txt.1.tmp /dev/stdin
[Sun 27 Jan 22:40:39 GMT 2019] mega-reads pass 2 on the grid exited, please re-run assemble.sh
slurmstepd: error: task_p_post_term: rmdir(/dev/cpuset/slurm2797987/slurm2797987.4294967294_0) failed Device or resource busy

alekseyzimin commented 5 years ago

One of the input coords.batch.gz is broken. You can run gzip -t coords.batch.gz, figure out which one is broken, delete is and the corresponding .success file and re-run.

drsfarhat commented 4 years ago

Hi, I encounter the same issue and ran gzip -t on all my files in mr_pass2/coords.batch*.gz but no broken file was found. Moreover, all the jobs finished well.

Here the output : [Wed Jan 8 08:58:10 EST 2020] Processing pe library reads [Wed Jan 8 08:58:10 EST 2020] Average PE read length 151 [Wed Jan 8 08:58:10 EST 2020] Using kmer size of 99 for the graph [Wed Jan 8 08:58:11 EST 2020] MIN_Q_CHAR: 33 [Wed Jan 8 08:58:11 EST 2020] Estimated genome size: 1949705988 [Wed Jan 8 08:58:11 EST 2020] Computing super reads from PE [Wed Jan 8 08:58:11 EST 2020] Using CABOG from /gpfs/software/MaSuRCA-3.3.4/bin/../CA8/Linux-amd64/bin [Wed Jan 8 08:58:11 EST 2020] Running mega-reads correction/assembly [Wed Jan 8 08:58:11 EST 2020] Using mer size 15 for mapping, B=17, d=0.029 [Wed Jan 8 08:58:11 EST 2020] Estimated Genome Size 1949705988 [Wed Jan 8 08:58:11 EST 2020] Estimated Ploidy 1 [Wed Jan 8 08:58:11 EST 2020] Using 40 threads [Wed Jan 8 08:58:11 EST 2020] Output prefix mr.41.15.17.0.029 [Wed Jan 8 08:58:11 EST 2020] Pacbio coverage >25x, using 25x of the longest reads [Wed Jan 8 08:58:11 EST 2020] Mega-reads pass 2 [Wed Jan 8 08:58:11 EST 2020] Running on the grid in 512 batches ./longest_path.sh: line 3: 415591 Killed /gpfs/software/MaSuRCA-3.3.4/bin/longest_path -t 20 -u ../guillaumeKUnitigsAtLeast32bases_all.41.fasta -k 41 -d 0.029 -o mr.txt.3.tmp /dev/stdin [Wed Jan 8 09:27:21 EST 2020] mega-reads pass 2 on the grid exited, please re-run assemble.sh Invalid input file. File truncated [Wed Jan 8 09:27:21 EST 2020] mega-reads exited before assembly

alekseyzimin commented 4 years ago

Hello,

One of the jf_aligner jobs failed resulting in truncated output. The following gzip did not abort resulting in correct gzip file containing truncated output. To diagnose, which mr_pass2/coords.batch*.gz file is broken, you can feed the files one at a time to longest_path, discarding the output. longest_path will run fast, but will produce nonsense output. One of the files will fail.

--Aleksey

On Wed, Jan 8, 2020 at 11:46 AM drsfarhat notifications@github.com wrote:

Hi, I encounter the same issue and ran gzip -t on all my files in mr_pass2/coords.batch*.gz but no broken file was found. Moreover, all the jobs finished well.

Here the output : [Wed Jan 8 08:58:10 EST 2020] Processing pe library reads [Wed Jan 8 08:58:10 EST 2020] Average PE read length 151 [Wed Jan 8 08:58:10 EST 2020] Using kmer size of 99 for the graph [Wed Jan 8 08:58:11 EST 2020] MIN_Q_CHAR: 33 [Wed Jan 8 08:58:11 EST 2020] Estimated genome size: 1949705988 [Wed Jan 8 08:58:11 EST 2020] Computing super reads from PE [Wed Jan 8 08:58:11 EST 2020] Using CABOG from /gpfs/software/MaSuRCA-3.3.4/bin/../CA8/Linux-amd64/bin [Wed Jan 8 08:58:11 EST 2020] Running mega-reads correction/assembly [Wed Jan 8 08:58:11 EST 2020] Using mer size 15 for mapping, B=17, d=0.029 [Wed Jan 8 08:58:11 EST 2020] Estimated Genome Size 1949705988 [Wed Jan 8 08:58:11 EST 2020] Estimated Ploidy 1 [Wed Jan 8 08:58:11 EST 2020] Using 40 threads [Wed Jan 8 08:58:11 EST 2020] Output prefix mr.41.15.17.0.029 [Wed Jan 8 08:58:11 EST 2020] Pacbio coverage >25x, using 25x of the longest reads [Wed Jan 8 08:58:11 EST 2020] Mega-reads pass 2 [Wed Jan 8 08:58:11 EST 2020] Running on the grid in 512 batches ./longest_path.sh: line 3: 415591 Killed /gpfs/software/MaSuRCA-3.3.4/bin/longest_path -t 20 -u ../guillaumeKUnitigsAtLeast32bases_all.41.fasta -k 41 -d 0.029 -o mr.txt.3.tmp /dev/stdin [Wed Jan 8 09:27:21 EST 2020] mega-reads pass 2 on the grid exited, please re-run assemble.sh Invalid input file. File truncated [Wed Jan 8 09:27:21 EST 2020] mega-reads exited before assembly

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/93?email_source=notifications&email_token=AGPXGHINRNG5PBB3EOVENRDQ4X7OVA5CNFSM4GSVV5SKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINGRKI#issuecomment-572156073, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHPKKQALO4XNOCIG25DQ4X7OVANCNFSM4GSVV5SA .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

elbourret commented 4 years ago

I have a similar problem, using Masurca 3.3.5.

My assembly stopped with this error:

[Thu Jan  9 08:54:35 CET 2020] Running on the grid in 241 batches
./longest_path.sh: line 3: 631346 Killed                  /net/cephfs/data/elevei/progs/MaSuRCA-3.3.5/bin/longest_path -t 
20 -u ../guillaumeKUnitigsAtLeast32bases_all.41.fasta -k 41 -d 0.02 -o mr.txt.3.tmp /dev/stdin
[Sat Jan 11 00:41:28 CET 2020] mega-reads pass 2 on the grid exited, please re-run assemble.sh
[Sat Jan 11 00:41:28 CET 2020] Assembly with flye failed, see files under flye/

I ran gzip -t on each coords.batch*.gz files, and they are all fine.

I ran longest_path on each coords.batch*.gz file, and get strange output. I'm not sure how to interpret the output, and which file are possibly broken? Are the files outputting "Invalid input file." broken? If yes, it seems all files are broken. More likely, I did not use the correct command for longest_path.

Here is the output from the first 4 files I fed to longest_path (I added the file names before each output):

coords.batch100.gz
Invalid input file. Expected number of lines but got: ¯Ý…ó"d%BÍyW o¥LÜMLZ(ëžGƒ¢Dڐ1N°ä­²êaqðuží–K4ÞŒ†Òü›W瘀M–FÐZRãĹ)ÒÅcjDŽWØ “ŸAU   
coords.batch101.gz
Invalid input file. Expected number of lines but got: ÙÛš¬¶;‰ÎÔH"{Sü؁Ì]@Ît;áÝ7‰?èˆ&þHˆ5Mc:¶ë6’ì Õ    íÎáۘW‡Ì­©W»¿ð‘d؅K±ŒõƒÇqzI,gƞ§xDŒÛPöú!a0áYŠA\PžÌ7¢±M„“H+),51éIšxTd‰r”!ԝË4%]±xÉµà™Ÿ‰sâL^±!XGm7æQ·TfÉóÊ1œ"—_G$ºýTÄ/’
coords.batch102.gz
Invalid input file. Expected number of lines but got: ÖÂÝ[õƒüa5/k>V¡îaª(ee‡?Êùtœ{CŸT2)Y«;Šš×GE!ýü|å×ê<–ûºÙÄóa   ’žnš&œ/WÚGùΡì‹
coords.batch103.gz
Invalid input file. Expected number of lines but got: f£Ú

I used this command to get the results above: nohup bash -c 'for f in coords.batch*.gz; do echo $f; /net/cephfs/data/elevei/progs/MaSuRCA-3.3.5/bin/merge_coords $f | /net/cephfs/data/elevei/progs/MaSuRCA-3.3.5/bin/ufasta extract -v -n "0" | /net/cephfs/data/elevei/progs/MaSuRCA-3.3.5/bin/longest_path -t 2 -u ../guillaumeKUnitigsAtLeast32bases_all.41.fasta -k 41 -d 0.02 -o /dev/null /dev/stdin; done' > test.longest_path.log &

alekseyzimin commented 4 years ago

One or more of the coords.batch*.gz files are corrupted. To test, run inside mr_pass2:

for k in $(ls coords.batch*.gz);do echo -n "Testing $k ";zcat -f $k | grep '^>' |wc -l;done

Each "Testing" line will give you a number that should be equal to the number of pacbio/nanopore reads you have. If the number is smaller-- the fille is corrupted. Delete the corresponding .success file, re-generate assemble.sh and re-run.

On Mon, Jan 13, 2020 at 7:39 AM Étienne Léveillé-Bourret < notifications@github.com> wrote:

I have the same problem, using Masurca 3.3.5.

My assembl stopped with this error:

[Thu Jan 9 08:54:35 CET 2020] Running on the grid in 241 batches ./longest_path.sh: line 3: 631346 Killed /net/cephfs/data/elevei/progs/MaSuRCA-3.3.5/bin/longest_path -t 20 -u ../guillaumeKUnitigsAtLeast32bases_all.41.fasta -k 41 -d 0.02 -o mr.txt.3.tmp /dev/stdin [Sat Jan 11 00:41:28 CET 2020] mega-reads pass 2 on the grid exited, please re-run assemble.sh [Sat Jan 11 00:41:28 CET 2020] Assembly with flye failed, see files under flye/

I ran gzip -t on each coords.batch*.gz files, and they are all fine.

I ran longest_path on each coords.batch*.gz file, and get strange output. I'm not sure how to interpret the output, and which file are possibly broken? Are the files outputting "Invalid input file." broken? If yes, it seems most files are broken, which is weird, because everything seemed fine up to now.

Here is the output from the first 10 files I fed to longest_path (I added the file names before each output):

coords.batch100.gz Invalid input file. Expected number of lines but got: ��݅�"d%B�yW�o ��L�MLZ�(��G���Dڐ1N�䭲�aq�u�����K4���������W瘀M�F�ZR�Ĺ)���cjD�W�ؠ��AU coords.batch101.gz Invalid input file. Expected number of lines but got: �������;���H"{S�؁��]@�t;��7�?�ˆ&�H�5 �Mc:���6�� � ���ۘW�̭�W�����d؅�K�����qz�I,gƞ��xD��P����!a�0�Y�A\P��7��M����H+),51�I�xTd���r�!ԝ�4%]���x��ɵ������s�L^��!XGm7�Q�Tf���1�"��_G$��T�/� coords.batch102.gz Invalid input file. Expected number of lines but got: ���[���a5�/k�>V���a�(ee��?��t�{C�T2)Y�;���GE!��|���<������a ��n�����&�/W�G��ˇ�� coords.batch103.gz Invalid input file. Expected number of lines but got: ��� coords.batch104.gz Invalid input file. Expected number of lines but got: ����}���púc{��Kͅ�������3qՀ��1Y�#>��K����L:̻G����������-ql+�R��CN��f��tO���u����V�������dѼ5C�ɫ��X��y���!Ö^�� ܓ4hϤx+�8v����~�a8B 1�K��$:�1�0d�}�6�������2�9�y{z� �/�'#���K,X�ã�l�Zb�� ��̌|V������=\��&��r[�R�e�b�o7,�6��t����+�� coords.batch105.gz Invalid input file. Expected number of lines but got: �������&�L�7�^:�=���T�5�]���G���7����˛��39�J��P���'ƕ�y����o���"-Du㲔���h��T���l~���S�T��#%�D??>5���(�t��tzJ�<�O�s�'5rL�ݟ�m4գ����/��)%���B|�`7�49t�o�=��Ϥ�W�=eҟ�M6%���g��W��7�P��������(�g����LJ�w���������?���恦~2u!���GD?<�������O�&$d�\��O�w�oQ,���� coords.batch106.gz Invalid input file. Expected number of lines but got: � ��?�����A� coords.batch107.gz �b�м�<Ӧ�y�B+m?�A��%;��;H����4d���&=V���˺������(����t: �1���=�ұ��>"��[����l����0��7��1DX\aw�P\�J�|��av �z��h��]�U�Q��� 6�9�Q�c 5V�kН��n2���m�(D��>$��;Yi��JԸ�����S:Y coords.batch108.gz Invalid input file. Expected number of lines but got: �~��)��(u���>}~"r�/��x{ ��Sd˽� t��������5��}~���s�$)���� h���t��c?� r�}�����{HR-f��<@���פ����m\���� cDbo���=�E0E$�z�`���ߢ�hw��1�Q2��=O���<��ȱ#]�)�C��COO���_�����րx�-Fϟ��% ý��B��#��:Vq�U�G�5���ہ�σ�CME��;j>����@p� �������T��O��Q���p>��%/ιOs����R����~�=E��c�}�*��t�� �4�{������qTAdo����N coords.batch109.gz Invalid input file. Expected number of lines but got: ��d�Cjcb�a�8��%�l coords.batch10.gz �Ѭv-bс�b)�J%V~����yr��X�}����X��:���QZq�M�~ߠ���G�ڱ��ï\��or��)��;e�?"��1�+�Zʖs��җ�f����

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/93?email_source=notifications&email_token=AGPXGHOIQWN3GPFBT6ZUCX3Q5ROJHA5CNFSM4GSVV5SKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIYRY3Y#issuecomment-573643887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHKCHKUFILX3ZWLDC63Q5ROJHANCNFSM4GSVV5SA .

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

elbourret commented 4 years ago

The coords.batch*.gz files are all fine according to gzip -t, and all contain the same number of lines, corresponding to the right number of reads. However, the assembly is still not running.

Masurca is creating 3 mr.txt.*.tmp files that get several Gb in size, but after several hours it crashes. If I try to restart, the mr.txt.*.tmp files are re-created from scratch, so I lose progress.

Note that I started the assembly with Masurca 3.3.4, but used 3.3.5 for the mr_pass2 stage, because I had another problem and thought that the newer version might fix it. In the end, the version change did not seem to make a difference. I eventually solved the other problem by splitting the mr_create jobs in smaller batches, which for some reason lowered the memory consumption (even though it is not supposed to do so). Once I did that, everything ran smoothly. I don't think any of this is related to the current problem, but I thought I should report this just in case.

Is there anything else I can try to solve my current problem? I hope I don't need to restart the whole mr_pass2 stage, as it took a long time...