HingeAssembler / HINGE

Software accompanying "HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution"
http://genome.cshlp.org/content/27/5/747.full.pdf+html?sid=39918b0d-7a7d-4a12-b720-9238834902fd
Other
64 stars 9 forks source link

Get draft assembly error #60

Closed agroppi closed 8 years ago

agroppi commented 8 years ago

Hi,

Still trying to work with HINGE on my data set. Here is a test on a single SMRT Cell. All previous steps went well. But now i'm stucked with :

`Run postprocessing: Mon Aug 29 11:10:47 CEST 2016

get draft assembly : Mon Aug 29 11:23:49 CEST 2016 Traceback (most recent call last):

File "/home/ag/rainman_home/HINGE/scripts/get_draft_path.py", line 24, in stdout=subprocess.PIPE,bufsize=1)

File "/module/apps/python/2.7.9/lib/python2.7/subprocess.py", line 710, in init errread, errwrite)

File "/module/apps/python/2.7.9/lib/python2.7/subprocess.py", line 1335, in _execute_child raise child_exception

OSError: [Errno 2] No such file or directory

[2016-08-29 11:23:50.855] [log] [info] draft consensus

[2016-08-29 11:23:50.855] [log] [info] name of db: vriparia_test, name of .las file vriparia_test.las

[2016-08-29 11:23:50.855] [log] [info] name of fasta: , name of .paf file

[2016-08-29 11:23:50.855] [log] [info] filter files prefix: vriparia_test

[2016-08-29 11:23:50.855] [log] [info] output prefix: vriparia_test.draft

[2016-08-29 11:23:50.855] [log] [info] Parameters passed in`

My lines of script :

echo "#############################" echo "Run postprocessing:"date` python2.7 /home/ag/rainman_home/HINGE/scripts/pruning_and_clipping.py vriparia_test.edges.hinges vriparia_test.hinge.list hinge_1

echo "#############################" echo "get draft assembly :" date python2.7 /home/ag/rainman_home/HINGE/scripts/get_draft_path.py $PBS_O_WORKDIR vriparia_test vriparia_testhinge_1.G2.graphml /home/ag/rainman_home/HINGE/build/bin/consensus/draft_assembly --db vriparia_test --las vriparia_test.las --prefix vriparia_test --config /home/ag/rainman_home/HINGE/utils/nominal.ini --out vriparia_test.draft`

I have updated this morning get_draft_path.py according https://github.com/fxia22/HINGE/commit/9ea413d3ddd71dec54d1fb5db5a1c7db606a7317

Thanks for your help

agroppi commented 8 years ago

Hi,

I have modified the script get_draft_path.py by adding "shell=True" :

stream = subprocess.Popen(DBshow_cmd.split(),
                                  stdout=subprocess.PIPE,bufsize=1,shell=True)

It seems to improve the process, but now I have another error :

Run postprocessing: Mon Aug 29 16:11:39 CEST 2016 ############################# get draft assembly : Mon Aug 29 16:24:02 CEST 2016 /scratch/ag/Vitireseq_Fasta/HINGE_Test/vriparia_test: DBshow: command not found Traceback (most recent call last): File "/home/ag/rainman_home/HINGE/scripts/get_draft_path.py", line 101, in vert_len = len(read_dict[int(vert_id)][1]) KeyError: 237590 [2016-08-29 16:24:04.237] [log] [info] draft consensus [2016-08-29 16:24:04.237] [log] [info] name of db: vriparia_test, name of .las file vriparia_test.las [2016-08-29 16:24:04.237] [log] [info] name of fasta: , name of .paf file [2016-08-29 16:24:04.237] [log] [info] filter files prefix: vriparia_test [2016-08-29 16:24:04.237] [log] [info] output prefix: vriparia_test.draft [2016-08-29 16:24:04.237] [log] [info] Parameters passed in

agroppi commented 8 years ago

Hi again,

I put the whole path to DBshow :

DBshow_cmd = "/home/ag/rainman_home/HINGE/thirdparty/DAZZ_DB/DBshow "+ filedir+'/'+ filename+' '+dbshow_reads
stream = subprocess.Popen(DBshow_cmd.split(),
                                  stdout=subprocess.PIPE,bufsize=1,shell=True)

New error :

Run postprocessing: Mon Aug 29 16:37:34 CEST 2016 ############################# get draft assembly : Mon Aug 29 16:49:43 CEST 2016 Usage: DBshow [-unqUQ] [-w<int(80)>] [-m]+ path:db|dam [ reads:FILE | reads:range ... ] Traceback (most recent call last): File "/home/ag/rainman_home/HINGE/scripts/get_draft_path.py", line 101, in vert_len = len(read_dict[int(vert_id)][1]) KeyError: 237590 [2016-08-29 16:49:44.781] [log] [info] draft consensus [2016-08-29 16:49:44.781] [log] [info] name of db: vriparia_test, name of .las file vriparia_test.las [2016-08-29 16:49:44.781] [log] [info] name of fasta: , name of .paf file [2016-08-29 16:49:44.781] [log] [info] filter files prefix: vriparia_test [2016-08-29 16:49:44.781] [log] [info] output prefix: vriparia_test.draft [2016-08-29 16:49:44.782] [log] [info] Parameters passed in

govinda-kamath commented 8 years ago

Hi Alexis,

Would it be possible for you to send us the vriparia_testhinge_1.G2.graphml that creates the error for us to debug that? The graphml file doesn't contain the sequence information, so you wouldn't be sharing any data (in case that's a concern for you).

Thanks, Govinda.

agroppi commented 8 years ago

@govinda-kamath

here is the file Thanks for your help

Alexis vriparia_testhinge_1.G2.graphml.zip

ilanshom commented 8 years ago

Hi Alexis,

It seems that the DBshow command is not running properly. If you run

/home/ag/rainman_home/HINGE/thirdparty/DAZZ_DB/DBshow $PBS_O_WORKDIR/vriparia_test 121

does it print the sequence for read 121?

agroppi commented 8 years ago

Hi,

This command print :

m000_000/124/0_14063 acggttagctttaaaaaaaacataatctttgtggaaaaggtgttttaagctaaccatataaataataactgattttaaat ...etc ...

agroppi commented 8 years ago

Hi,

It seems that there is a shift between the number of the read in the DBshow command and the identifier in the fasta output I've tested From 1 to 12 ==> correct From 13 to 76 ==> shift +1 example :

ag@jarvis:~$ /home/ag/rainman_home/HINGE/thirdparty/DAZZ_DB/DBshow /scratch/ag/Vitireseq_Fasta/HINGE_Test/vriparia_test 13

m000_000/14/0_12805

From 77 to 81 ==> shift +2 From 81 to ==> shift +3

At 1000 the shift is 22 , etc etc

Does this shifting cause the problem ?

ilanshom commented 8 years ago

I don't think this shifting is causing the problem.

The fact that when you run get_draft_path you get Usage: DBshow [-unqUQ] [-w] [-m]+ path:db|dam [ reads:FILE | reads:range ... ] as part of your output suggests that the script is not calling DBshow in the correct manner. Can you modify the script to print the string DBshow_cmd (line 22) and show me what it prints? It's going to be a long string, as it includes thousands of read ids, but the beginning is what matters. Thanks.

agroppi commented 8 years ago

Hi,

Here is the beginning of the DBshow output :

/home/ag/rainman_home/HINGE/thirdparty/DAZZ_DB/DBshow /scratch/ag/Vitireseq_Fasta/HINGE_Test/vriparia_test 121 680 767 883 958 1035 1407 1568

I attach also the whole output.

Thanks again DBshow_cmd_output.txt

ilanshom commented 8 years ago

Hi Alexis, if you just run

/home/ag/rainman_home/HINGE/thirdparty/DAZZ_DB/DBshow /scratch/ag/Vitireseq_Fasta/HINGE_Test/vriparia_test 121 680 767 883 958 1035 1407 1568

do the sequences appear properly? This command looks fine, but above you said that when you ran get_draft_path, it was printing Usage: DBshow [-unqUQ] [-w] [-m]+ path:db|dam [ reads:FILE | reads:range ... ]. This should only be printed if there was a problem with the command DBshow_cmd

agroppi commented 8 years ago

yes this command shows the sequence properly

ilanshom commented 8 years ago

But when you run the script you are still getting Usage: DBshow [-unqUQ] [-w] [-m]+ path:db|dam [ reads:FILE | reads:range ... ]?

govinda-kamath commented 8 years ago

Ah! The problem seems to be in the modification of the command.

It should be

DBshow_cmd = "/home/ag/rainman_home/HINGE/thirdparty/DAZZ_DB/DBshow "+ filedir+'/'+ filename+' '+dbshow_reads
stream = subprocess.Popen(DBshow_cmd,
                                  stdout=subprocess.PIPE,bufsize=1,shell=True)

instead of

DBshow_cmd = "/home/ag/rainman_home/HINGE/thirdparty/DAZZ_DB/DBshow "+ filedir+'/'+ filename+' '+dbshow_reads
stream = subprocess.Popen(DBshow_cmd.split(),
                                  stdout=subprocess.PIPE,bufsize=1,shell=True)

When using shell=True, the arguments should not be split. This stackexchange answer explains it.

agroppi commented 8 years ago

Unfortunately yes :(

Run postprocessing: Wed Aug 31 20:57:49 CEST 2016 ############################# get draft assembly : Wed Aug 31 21:11:57 CEST 2016 Usage: DBshow [-unqUQ] [-w<int(80)>] [-m]+ path:db|dam [ reads:FILE | reads:range ... ] Traceback (most recent call last): File "/home/ag/rainman_home/HINGE/scripts/get_draft_path.py", line 102, in vert_len = len(read_dict[int(vert_id)][1]) KeyError: 237590 [2016-08-31 21:11:58.785] [log] [info] draft consensus [2016-08-31 21:11:58.785] [log] [info] name of db: vriparia_test, name of .las file vriparia_test.las [2016-08-31 21:11:58.785] [log] [info] name of fasta: , name of .paf file [2016-08-31 21:11:58.785] [log] [info] filter files prefix: vriparia_test [2016-08-31 21:11:58.785] [log] [info] output prefix: vriparia_test.draft [2016-08-31 21:11:58.786] [log] [info] Parameters passed in

any idea ?

agroppi commented 8 years ago

Sorry didn't see your last comment :/ I'll try

agroppi commented 8 years ago

Great news : It works ! :)

############################# Run postprocessing: Wed Aug 31 21:18:07 CEST 2016 ############################# get draft assembly : Wed Aug 31 21:31:43 CEST 2016 [2016-08-31 21:31:46.789] [log] [info] draft consensus [2016-08-31 21:31:46.789] [log] [info] name of db: vriparia_test, name of .las file vriparia_test.las [2016-08-31 21:31:46.789] [log] [info] name of fasta: , name of .paf file [2016-08-31 21:31:46.789] [log] [info] filter files prefix: vriparia_test [2016-08-31 21:31:46.789] [log] [info] output prefix: vriparia_test.draft [2016-08-31 21:31:46.790] [log] [info] Parameters passed in

Despite I read that it is not a good thing to use shell=True (https://security.openstack.org/guidelines/dg_avoid-shell-true.html) ...

I will keep you informed about the rest of the pipeline

Thanks again