PacificBiosciences / FALCON_unzip

Making diploid assembly becomes common practice for genomic study
BSD 3-Clause Clear License
30 stars 18 forks source link

unzip (installed on 20170823) failed at the task 0-phasing/012822F/blasr/task.json #93

Closed nottwy closed 6 years ago

nottwy commented 7 years ago

Under the directory 3-unzip/reads/, there is a file called '012822F_ref.fa'. But there isn't the corresponding read file '012822F_reads.fa'. And it caused a task's (0-phasing/012822F/blasr/task.json) failure. It needs this read file.

I don't know how to generate this file manually. Can you tell me ?

gconcepcion commented 7 years ago

If one raw reads file is absent, it implies that the read tracking step didn't finish properly. There may also be other raw reads files missing.

You can try re-running the read fetching step in track_reads.sh to ensure you've fetched all raw reads. track_reads.sh

$ cat 3-unzip/reads/track_reads.sh
set -vex
trap 'touch /lustre/hpcprod/gconcepcion/170814Hemichordate/3-unzip/reads/track_reads_done.exit' EXIT
hostname
date
cd ../..
#python -m falcon_kit.mains.get_read_ctg_map
#python -m falcon_kit.mains.rr_ctg_track
#python -m falcon_kit.mains.pr_ctg_track
#mkdir -p 3-unzip/reads/
python -m falcon_kit.mains.fetch_reads
cd /lustre/hpcprod/gconcepcion/170814Hemichordate/3-unzip/reads
date
touch /lustre/hpcprod/gconcepcion/170814Hemichordate/3-unzip/reads/track_reads_done

Comment out the first three python steps, and try re-running the final python -m falcon_kit.mains.fetch_reads step by itself.

nottwy commented 7 years ago

Thanks for your reply! I have tried to read the code of _fetchreads.py and encountered a problem that I can't solve. I have a contig whose name is '012822F'. I parsed the file 'rawread_ids' and 'pread_ids' to find records that can be matched by the pattern /012822F . 0/ because I found that in the code of _fetchreads.py it requires 'int(row[3]) == 0'. So I tried to check condition of '012822F' manually. Finally I found there were 5 records which satisfied this condition in 'pread_ids'.

By the way, I have a little question here: what does the value of row[3] mean?

Because there is no 012822F_reads.fa, I think it must be caused by the failing of reaching some of your requirements. Therefore I have modified your code to output every values to make sure these values of the contig '012822F' satisfying your requirement of output, including:

  1. The ctg_id_hits value of 012822F is: 5;
  2. 012822F is in the all_ctg_ids;

Now I still haven't found the reason why there is no 012822F_reads.fa. Do you have any suggestion about where there would be a problem? Thanks!

nottwy commented 7 years ago

I have found the reason. There are some errors in your function 'pid_to_oid(pid)'. It caused different preads linked to the same subreads. And the latter one will overwrite the former one. So the subreads of contig 012822F are linked to other contigs and can't be outputed. I'm sure it's a bug. How do you think of this problem?

The content of the function is: def pid_to_oid(pid): fid = pid_to_fid[int(pid)] rid = int(fid.split('/')[1])/10 return rid_to_oid[int(rid)]

The input of this function is pid and the output of this function is oid. Different pids could have the same oid with the transform of this function. I can show you some data. oid and pid: m54045_170723_035233/7995968/0_18238 000000080 m54045_170723_035233/7995968/0_18238 000000081

m54045_170723_035233/10158447/0_40972 000000296 m54045_170723_035233/10158447/0_40972 000000297

m54045_170723_035233/11862691/0_23644 000000487 m54045_170723_035233/11862691/0_23644 000000489

If we have lots of data, this problem will be ignored because there will be some subreads rest and the problem will not cause the unzip failing.

cschin commented 7 years ago

@nottwy Due to the error rate and repeat genome content, it is possible that a subread can be used in correcting multiple preads. However, each pid read can be only trace to single subread (or one of the subreads from the ZMW, persumablly the all subreads from the same ZMW are from the same molecule). In your case, it is like not there is no unique track from a pid to a subread. But there are some better and worse error corrected reads, and all subreads are eventually be recurited by the better one.

nottwy commented 7 years ago

Let's start from a simple point. Now I just want to make unzip work properly. But because the file '012822F_reads.fa' can't be created due to the reason I said above, unzip always reported an error and can't proceed. How can I solve this problem?

I have discover the reason completely. It's like this:

  1. we find there are enough subreads for the contig 012822F, so the contig 012822F is included in the list of the contigs which will be phased.
  2. unzip tried to output the corresponding subreads of this contig, it found and stored the relationship between the subreads and this contig.
  3. because these subreads were also linked to other contigs, so the relationship between these subreads and the contig 012822F lost.
  4. the file 012822F_reads.fa can't be generated;
  5. unzip reported an error and stoped.

    Wait for your reply! @cschin ,also @gconcepcion

cschin commented 7 years ago

Again, please ask @PacBio for how they want to "officially" (if there is such thing) handle this for you. On the other hand, the design was the code will look for a list to see which config should be processed. You can remove the problematic config from the list. There is a way to hack through it as you have the source code. I simply have no resource to test any suggestion I make now.

nottwy commented 7 years ago

OK. Let's wait for @gconcepcion 's reply.

gconcepcion commented 7 years ago

Hi nottwy,

It sounds like you have a good grasp on the problem, please feel free to submit a pull request and we can integrate a fix into the code. In the meantime, @cschin has provided a valid work around, simply remove that contig from the file 3-unzip/reads/ctg_list file and re-run the unzip pipeline. The offending contig should be ignored and you will be able to proceed with the pipeline.

nottwy commented 7 years ago

@gconcepcion , Thanks for @cschin 's advice.I also find this way but put it aside before because I think it's not a way to solve this problem. There should be other choice but it may take some time to find it. I'm trying but haven't found it yet. I'm still thinking of this problem. If I have any progress or any interesing ideas, I will send them to u and ask for your opinions if you leave me your contacts (or send them to me: nottwya@gmail.com). Or I'll paste them here. Maybe you can't notice my issue in time because I know u receive lots of issues notification every day.

nottwy commented 7 years ago

I find that there may exist an oid-rid match error. @gconcepcion In this file (file A) '2-asm-falcon/read_maps/get_ctg_read_map', it's the relationship the unzip uses to determine the match between oid and rid. And in (file B) '2-asm-falcon/read_maps/dump_rawread_ids/rawread_ids', it contains names of all subreads which are dumped from the database. The problem is, for example, the oid of the subread at the 3th row of the file B should be 3. (We get this conclusion from the las file. The first column and the second column of the las file are oid. Is it correct? And it starts from 1.) But in the file A, it became 2. And all oids of the subreads in the file A are 1 smaller than the the oids of the corresponding subreads.

nottwy commented 7 years ago

@cschin @gconcepcion , The observation I presented above is based on the unzip from this link: https://downloads.pacbcloud.com/public/falcon/falcon-2017.06.28-18.01-py2.7-ucs4.tar.gz And I install my falcon and falcon-unzip with this script: http://pb-falcon.readthedocs.io/en/latest/_downloads/install_unzip.sh Hope this can help u locate the bug of the code.

iggyB commented 6 years ago

Hi,

Any progress? Any ideas?

Bumped into similar issue. 22 contigs have no corresponding reads.fa

nottwy commented 6 years ago

Finally, I found that: LA4Falcon and LAshow works slightly differently. The id of LA4Falcon starts from 0 and LAshow starts from 1.