PacificBiosciences / FALCON_unzip

Making diploid assembly becomes common practice for genomic study
BSD 3-Clause Clear License
30 stars 18 forks source link

Most files empty in 3-unzip directory #25

Closed peterdfields closed 8 years ago

peterdfields commented 8 years ago

I tried running fc_unzip.py and while the processes complete without error the majority of files in the resulting 3-unzip directory are blank. For example, while the directory contains the files

all_h_ctg_edges all_h_ctg.fa all_h_ctg_ids all_p_ctg_edges all_p_ctg.fa all_phased_reads

and sub-directories

1-hasm reads

the files are completely blank.

I'm running the most recent versions of FALCON-integrate, FALCON-unzip, SMRT analysis v3. My fc_unzip.cfg is as follows

[General] job_type = local

[Unzip] input_fofn= input.fofn input_bam_fofn= input_bam.fofn

smrt_bin=/home/peter/bioinformatics/smrtlink/install/smrtlink-fromsrc_3.0.5.175021,175083-175021-174993-174993/bundles/smrttools/smrtcmds/bin/

jobqueue = sge_phasing= sge_quiver= sge_track_reads= sge_blasr_aln= sge_hasm= unzip_concurrent_jobs = 20 quiver_concurrent_jobs = 20

Any ideas as to how I might troubleshoot what might be going wrong? I worry that it might be the use of the 'local' job_type. The present OS is CentOS7.

gconcepcion commented 8 years ago

Hi Peter,

This sounds like a read tracking issue.

The first thing I would do is check that the raw reads were properly tracked into the 3-unzip directory.

$ du -hc 3-unzip/reads/

This number should roughly match the number of basepairs in your 0-rawreads/rawreads.db

$ DBstats 0-rawreads/raw_reads.db | head

The reads should be partitioned based on the primary contigs generated from the initial FALCON assembly.

Verify that the proportion of unassigned reads in "3-unzip/reads/unassigned_reads.fa" is not too high relative to the other primary contig read groupings.

I doubt job_type=local is the culprit, what makes you think that?

gconcepcion commented 8 years ago

Ahh - I'm sorry, I just re-read your first post and realized you said the "3-unzip/reads" directory IS blank.

This is definitely a read tracking issue.

Have you investigated the track_reads.sh* logs?

peterdfields commented 8 years ago

Sorry for the confusion but the 3-unzip/reads directory isn't empty. It does look like the vast majority of reads are ending up in the 'unassigned_reads.fa' file just looking at the file size.

Looking at the track_reads.sh.log file doesn't suggest anything went wrong. I can send it to you though if that would be helpful.

As for the problem with the 'local' job_type, I was simply concerned that since this setup hasn't been tested as much that something might be going wrong there.

gconcepcion commented 8 years ago

If you wouldn't mind sending the track_reads.sh.log, I can take a look and see if I notice any obvious issues

peterdfields commented 8 years ago

you can download the log file using the following link: https://dl.dropboxusercontent.com/u/3750627/track_reads.sh.log.gz

gconcepcion commented 8 years ago

There does appear to be an issue, and I'll have to take a deeper look.

Here is a line from your track_reads.sh.log:

[48791]starting run_tr_stage1('/home/peter/bioinformatics/FALCON-integrate_old/ecoli_test/0-rawreads/raw_reads.db', '/home/peter/bioinformatics/FALCON-integrate_old/ecoli_test/0-rawreads/m_00006/raw_reads.6.las',
2500, 40, {})

Here is a line from a recent successful run:

[61907]starting run_tr_stage1('/lustre/hpcprod/gconcepcion/example/0-rawreads/raw_reads.db', '/lustre/hpcprod/gconcepcion/example/0-rawreads/m_00002/raw_reads.2.las',
2500, 40, dict(14206 elem))

Notice the very last element (the dictionary) is empty in your line, but has 14206 elements in mine.

peterdfields commented 8 years ago

okay. Is there any additional files or info I can provide to help with the troubleshooting?

gconcepcion commented 8 years ago

This indicates that there is a problem resolving the read id --> contig.

Can you post a few lines from your 2-asm-falcon/read_maps/read_to_contig_map

peterdfields commented 8 years ago

So it would appear that that file is also blank!

I should state that in tuning the assembly I modified the .cfg file settings a few times to tinker with the overlap_filtering_setting parameters. I then re-ran fc_run.py (following deletion of the 2-asm-falcon directory), and sometimes there was an error displayed even though the run_falcon_asm.sh script continued to completion successfully.

Can you recommend a best practices way of restarting the FALCON run at just the 2-asm-falcon step, which I can then run, and then re-try the fc_unzip.py run (assuming this is why the 2-asm-falcon/read_maps/read_to_contig_map is blank, causing the downstream issues)?

gconcepcion commented 8 years ago

Yes, you will need to re-run 2-asm-falcon. The easiest thing to do just remove the 2-asm-falcon dir and re-run falcon which will pick up from this point: $ rm -rf 2-asm-falcon $ fc_run fc_run.cfg

Also, I noticed in the log file that you're unzipping Ecoli? I'm assuming this is just a test case to get FALCON_unzip up and running, but I'm wondering what you're trying to accomplish by "unzipping" a haploid organism?

peterdfields commented 8 years ago

I'm not running the ecoli example, it's just where I got everything working with FALCON-integrate and I just went with it. I will try the removal of the directory and re-run but I'm guessing the error will arise once again, wherein the pypeFLOW will show an error but the run_falcon_asm.sh script continues.

gconcepcion commented 8 years ago

Can you share the error next time you get it? That could be the problem we are looking for.

peterdfields commented 8 years ago

Re-running fc_run.py this time resulted in no errors but also no read_maps sub-directory. Do you have an idea about what might have gone wrong?

gconcepcion commented 8 years ago

Sorry for the confusion, I just realized read_maps is actually generated as the very first step of FALCON_unzip (and written to the 2-asm-falcon directory), not the last step of FALCON.

Can you try these two commands, and show me a couple of lines from each(or upload both to dropbox):

$ DBshow -n /lustre/hpcprod/gconcepcion/example/0-rawreads/raw_reads.db | tr -d '>' | LD_LIBRARY_PATH= awk '{print $1}' > raw_read_ids

$ DBshow -n /lustre/hpcprod/gconcepcion/example/1-preads_ovl/preads.db | tr -d '>' | LD_LIBRARY_PATH= awk '{print $1}' > pread_ids

peterdfields commented 8 years ago

head -10 of raw_reads_ids

m150128_212520_42243_c100748042550000001823169907081536_s1_p0/24/0_6994 m150128_212520_42243_c100748042550000001823169907081536_s1_p0/31/0_2603 m150128_212520_42243_c100748042550000001823169907081536_s1_p0/31/2649_4302 m150128_212520_42243_c100748042550000001823169907081536_s1_p0/45/0_9698 m150128_212520_42243_c100748042550000001823169907081536_s1_p0/45/9746_18796 m150128_212520_42243_c100748042550000001823169907081536_s1_p0/45/18836_28165 m150128_212520_42243_c100748042550000001823169907081536_s1_p0/60/0_1894 m150128_212520_42243_c100748042550000001823169907081536_s1_p0/64/0_10202 m150128_212520_42243_c100748042550000001823169907081536_s1_p0/66/0_6927 m150128_212520_42243_c100748042550000001823169907081536_s1_p0/68/0_2944

head -10 of pread_ids

prolog/0/0_3977 prolog/1/0_2539 prolog/30/0_9953 prolog/40/0_3462 prolog/50/0_8121 prolog/70/0_9023 prolog/110/0_9003 prolog/120/0_8150 prolog/130/0_2867 prolog/140/0_6390

I can send more if that's useful.

gconcepcion commented 8 years ago

Yes, visually they "appear" standard. If you wouldn't mind uploading them to dropbox, I can take a more thorough look at the read matching algorithm.

peterdfields commented 8 years ago

sure.

https://dl.dropboxusercontent.com/u/3750627/raw_read_ids.gz https://dl.dropboxusercontent.com/u/3750627/pread_ids.gz

gconcepcion commented 8 years ago

Hi Peter,

Sorry for the delay. If you could also upload: "2-asm-falcon/sg_edges_list" "2-asm-falcon/ctg_paths" "2-asm-falcon/utg_data"

I should hopefully be able to get to the bottom of this for you shortly.

peterdfields commented 8 years ago

You can use the following links:

https://dl.dropboxusercontent.com/u/3750627/ctg_paths.gz https://dl.dropboxusercontent.com/u/3750627/sg_edges_list.gz https://dl.dropboxusercontent.com/u/3750627/utg_data.gz

I should also say that I was able to get the fc_unzip.py step to run to completion though it still seems that there is a very large portion of the reads that are ending up in the unassigned_reads.fa file.

gconcepcion commented 8 years ago

Hmm, this is strange, when I run those data files through the read tracking code, I get a 'read_to_contig_map' that is 33,476 lines long, not empty as in your case.

What do you mean you were able to run it to completion, - it's complete, and the output files are no longer empty, just small? Is the read_to_ctg_map still empty?

How do you define "large portion"? Are there some reads that are now ending up in the proper fastas?

peterdfields commented 8 years ago

After I was able to get the fc_run.py to finish to completion (rather than seeing an error in the pypeFLOW output even though the run_falcon_asm.sh would continue executing), the fc_unzip.py script did run to completion as well as create non-empty and seemingly normal outputs. That said I don't know exactly what to expect. There are 4195964 reads in the input fasta, of which 2448521 are placed in the unassigned_reads.fa file. Does this seem excessive? And many of the reads are quite long. There does seem to be proper fasta files in the reads directory though.

gconcepcion commented 8 years ago

I agree, it "sounds" high, but for comparison sake here's some recent stats from a fungal and human genome I've recently unzipped:

unassigned reads / input subreads:

Fungal - 135,922 / 369,622 = roughly 36% unassigned

Human - 6,342,498 / 13,506,121 = roughly 47% unassigned

Your dataset - 2,448,521 / 4,195,964 = roughly 58% unassigned

Yes it's a little bit high, but it's certainly not too far outside the realm of possibility.

What are the final unzipped contig stats looking like?

3-unzip/all_p_ctg.fa should be roughly the size of your un-phased assembly in 2-asm-falcon - if not there is an issue.

and 3-unzip/all_h_ctg.fa: The size of your haplotig file should be dependent on the heterozogosity of your organism. The more heterozygous your organism is, the larger this file should be, but still only a subset of the primary contig fasta. If it's not very heterozygous, (very little difference between haplotypes), then the haplotig file should be small. I know this is not very quantitative, but it's hard to say without knowing much about your data.

At any rate, it's looking like the initial question that this issue was "opened" for has been resolved by re-running the last stage of FALCON so I'm going to close it.

Feel free to open a new issue regarding your final output statistics if you're still concerned.

peterdfields commented 8 years ago

Okay, I will proceed with the fc_quiver.py script and see how things go. Thank you for your help, Greg!