PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

where is test data (e.g., synth, ecoli, etc.)? #278

Closed dgordon562 closed 8 years ago

dgordon562 commented 8 years ago

Hi, Chris,

I'm sure there is something I need to run to populate these directories and I just don't know what it is (make something...)?

In FALCON-integrate/FALCON-examples/run are nice directories like this:

arab  ecoli  ecoli2  lambda  lambda-hgap  lambda-smb14  synth0

synth0 contains:

-rwxrwxr-x 1 dgordon eichlerlab 1045 Feb  1 12:07 check.py
drwxrwxr-x 2 dgordon eichlerlab 2048 Feb  1 12:07 data
-rw-rw-r-- 1 dgordon eichlerlab 1173 Feb  1 12:07 fc_run.cfg
-rw-rw-r-- 1 dgordon eichlerlab   18 Feb  1 12:07 input.fofn
-rw-rw-r-- 1 dgordon eichlerlab  714 Feb  1 12:07 logging.ini
drwxrwxr-x 3 dgordon eichlerlab 2048 Feb  1 12:07 .
-rw-rw-r-- 1 dgordon eichlerlab  300 Feb  1 12:07 makefile

but when you look in "data", you just get dead links to a directory that doesn't exist:

lrwxrwxrwx 1 dgordon eichlerlab   34 Feb  1 12:07 ref.fasta -> ../../../.git-sym/synth0.ref.fasta
lrwxrwxrwx 1 dgordon eichlerlab   41 Feb  1 12:07 synth0.fasta -> ../../../.git-sym/synth0-circ-20.pb.fasta

It is probably something obvious I haven't done...

Thanks! David

P.S. I'm trying to follow your advice of testing with synth0.

pb-cdunn commented 8 years ago

git-sym knows how to acquire the data. It's a pretty sweet little tool, actually. Super efficient. It looks for symlinks with a certain naming convention, and then it relies on git-sym.makefile to fetch the actual files and fill-in the symlinks.

FALCON-integrate knows how to put git-sym into your path, and they're both part of FALCON-integrate. But technically, you only need git-sym.

dgordon562 commented 8 years ago

what wget command should I use to download synth0? (I need this command whether I use git-sym or don't use it.)

pb-cdunn commented 8 years ago

But really, you don't need to run that directly. git-sym really is a nice system. It works marvelously well with git.

dgordon562 commented 8 years ago

Thanks, Chris.

I did successfully download and uncompress 3 files:

file with 10 sequences file.1 with 50 sequences file.2 with 1 sequence

file.2 appears to the answer.

Which of these should I do assemblies with? Just file.1 or file and file.1?

Thanks! David

pb-cdunn commented 8 years ago

git plus git-sym should work. Are you unable to run git on your system?

FALCON-integrate:master$ ls -l FALCON-examples/run/synth0/data
lrwxrwxrwx 1 cdunn Domain Users     34 Sep 20 08:36 ref.fasta -> ../../../.git-sym/synth0.ref.fasta
lrwxrwxrwx 1 cdunn Domain Users     41 Sep 20 08:36 synth0.fasta -> ../../../.git-sym/synth0-circ-20.pb.fasta
dgordon562 commented 8 years ago

Hi, Chris,

git runs fine on our system. I've been reading the git-sym documentation and have tried setting up the links but have been unsuccessful. couldn't I instead just run wget and gunzip and have it done in a minute or two? If you insist on git-sym, could you give me the commands for getting the synth0 dataset?

Thanks, David

pb-cdunn commented 8 years ago

Either of these should work:

cd FALCON-examples/run/synth0/data
git-sym update

Or:

cd FALCON-examples
git-sym update run/synth0/data
dgordon562 commented 8 years ago

Clearly git-sym is very elegant, but I'm under a lot of pressure. I'll send you the problems with it so far, but it is probably difficult for you to debug this remotely. I'm also under a lot of pressure to do other work so I can't take hours to download a little file. I'll give you the error message below. If you can't immediately see the problem, how about if you just answer the question above:

file with 10 sequences file.1 with 50 sequences file.2 with 1 sequence

file.2 appears to the answer.

Which of these should I do assemblies with? Just file.1 or file and file.1?

Here is the git-sym output:

~/falcon/160201/FALCON-integrate/git-sym/git-sym update -> in dir '.' <- back to dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/run/synth0' symlink: 'data/ref.fasta' symlink: 'data/synth0.fasta' 'data/ref.fasta' -> '../../../.git-sym/synth0.ref.fasta' does not exist 'data/synth0.fasta' -> '../../../.git-sym/synth0-circ-20.pb.fasta' does not exist -> in dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/links' -> in dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/cache' make -j -f /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/git-sym.makefile 'synth0.ref.fasta' wget https://www.dropbox.com/s/jz0m0n2a1b19xyd/from.fasta.gz --2016-02-15 12:58:58-- https://www.dropbox.com/s/jz0m0n2a1b19xyd/from.fasta.gz Resolving www.dropbox.com... 108.160.172.206, 108.160.172.238 Connecting to www.dropbox.com|108.160.172.206|:443... connected. HTTP request sent, awaiting response... 302 FOUND Location: https://dl.dropboxusercontent.com/content_link/4xCoOTCCsLwe8IEVC5iITPlhBkDetiFazxm26CadTb47E1F2d5IY7tZfimfd0dPT/file [following] --2016-02-15 12:58:58-- https://dl.dropboxusercontent.com/content_link/4xCoOTCCsLwe8IEVC5iITPlhBkDetiFazxm26CadTb47E1F2d5IY7tZfimfd0dPT/file Resolving dl.dropboxusercontent.com... 199.47.217.69 Connecting to dl.dropboxusercontent.com|199.47.217.69|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1694 (1.7K) [application/octet-stream] Saving to: “file”

100%[======================================>] 1,694 --.-K/s in 0s

2016-02-15 12:58:59 (284 MB/s) - “file” saved [1694/1694]

gunzip -c from.fasta.gz >| synth0.ref.fasta gzip: from.fasta.gz: No such file or directory make: * [synth0.ref.fasta] Error 1 Traceback (most recent call last): File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 455, in main cmd_table[cmd](args) File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 357, in git_sym_update retrieve(needed) File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 284, in retrieve retrieve_using_make(makefilename, paths) File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 277, in retrieve_using_make system(cmd) File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 81, in system raise Exception('%d <- %r' %(rc, cmd)) Exception: 512 <- "make -j -f /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/git-sym.makefile 'synth0.ref.fasta'"

subsequent attempts give similar errors except:

gunzip -c circ-20.pb.fasta.gz >| synth0-circ-20.pb.fasta gzip: circ-20.pb.fasta.gz: No such file or directory make: *\ [synth0-circ-20.pb.fasta] Error 1

I should also show you that the files in cache are zero length:

ll /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/cache total 128 drwxrwxr-x 4 dgordon eichlerlab 2048 Feb 15 12:55 .. -rw-rw-r-- 1 dgordon eichlerlab 1694 Feb 15 12:58 file -rw-rw-r-- 1 dgordon eichlerlab 0 Feb 15 12:58 synth0.ref.fasta -rw-rw-r-- 1 dgordon eichlerlab 13565 Feb 15 13:02 file.1 -rw-rw-r-- 1 dgordon eichlerlab 0 Feb 15 13:02 synth0-circ-20.pb.fasta drwxrwxr-x 2 dgordon eichlerlab 2048 Feb 15 13:02 .

pb-cdunn commented 8 years ago

I don't know what file/file.1/file.2 are. If you're looking for nice quick test data, you need synth0.ref.fasta and synth0-circ-20.pb.fasta.

You're having a problem with wget -- a 302 from dropbox. curl -L might work better for following the redirect. You can specify the output filename too.

I wish I were allowed to pursue an AWS-based solution for this. Downloading from dropbox can be problematic. But AWS costs money, and I'm not setting that up without a credit card from PacBio.

dgordon562 commented 8 years ago

Thanks, Chris. I think I have it now. synth0-circ-20.pb.fasta has 50 sequences, each with exactly 2000 bp. Sound right?

What do I need synth0.ref.fasta for? It doesn't go into the assembly, does it? Is it to check the answer?

pb-cdunn commented 8 years ago

There is a script somewhere called check.py which can verify that your draft assembly is a rotation of the reference. But no, you don't need the reference.

I'll switch git-sym.makefile to use curl -L. Thanks for reporting the snags.

dgordon562 commented 8 years ago

Hi, Chris,

Unfortunately, I'm running into the same problem trying to download the arabidopsis data:

I cd to :

cd ~/falcon/160201/FALCON-integrate/FALCON-examples/run/arab/data

I type:

git-sym update

and get:

-> in dir '.'
<- back to dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/run/arab/data'
symlink: 'arab-creads.fasta'
'arab-creads.fasta' -> '../../../.git-sym/arab-creads.fasta' does not exist
-> in dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/links'
-> in dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/cache'
make -j  -f /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/git-sym.makefile 'arab-creads.fasta'
cp -f /lustre/hpcprod/cdunn/data/arab_test/corrected.fasta arab-creads.fasta
cp: cannot stat `/lustre/hpcprod/cdunn/data/arab_test/corrected.fasta': No such file or directory
make: *** [arab-creads.fasta] Error 1
Traceback (most recent call last):
  File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 455, in main
    cmd_table[cmd](**args)
  File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 357, in git_sym_update
    retrieve(needed)
  File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 284, in retrieve
    retrieve_using_make(makefilename, paths)
  File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 277, in retrieve_using_make
    system(cmd)
  File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 81, in system
    raise Exception('%d <- %r' %(rc, cmd))
Exception: 512 <- "make -j  -f /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/git-sym.makefile 'arab-creads.fasta'"

It appears you've left some of your own full paths here? Could you give me the dropbox address and I can get it from there rather than using git-sym?

Thanks! David

dgordon562 commented 8 years ago

Hi, Chris,

If you look at:

FALCON-integrate/FALCON-examples/git-sym.makefile

you'll see that arabidopsis doesn't have an associated dropbox--you made it just copy from somewhere else on your filesystem.

Or...am I looking in the wrong place?

Thanks! David

dgordon562 commented 8 years ago

If it will take a while to fix the makefile, how about if I use the following to test Falcon on arabidopsis:

http://datasets.pacb.com.s3.amazonaws.com/2014/Arabidopsis-lyrata/list.html

?

pb-jchin commented 8 years ago

the lyrata may not be the best one for testing. In this site,the ler-0 or the dmel are better than the lyrata.

dgordon562 commented 8 years ago

Why is that? Does it assemble poorly? (ler-0 doesn't have raw fasta files so there is more work to use it for testing. I want a plant.)

pb-jchin commented 8 years ago

The lyrata was a diploid sample and it was hard to get HMW DNA...

dgordon562 commented 8 years ago

Why does that make it not the best for testing? Will it assemble poorly?

pb-jchin commented 8 years ago

what are you testing for? There is no good reference for Lyrata. The reads are shorter too.

pb-jchin commented 8 years ago

check this, you can get the data from SRA http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4365909/#S1

pb-cdunn commented 8 years ago

To answer the question on Arabidopsis, I pay for my own Dropbox account, so I can't fill it with large data. I hope to have a bigger-data solution in a couple weeks.

pb-jchin commented 8 years ago

Arabidopsis assembly will take a while, not the best for quickly testing the workflow anyway. Most of the data can be download from SRA. If necessary, we can write some automatically download and test scripts using SRA rather than Dropbox.

pb-cdunn commented 8 years ago

Jason, Aaron asked IT to get me an AWS account, but they are thinking we could use our Internet Colocation Site instead. We should have a solution for this within a couple of weeks.

zokie commented 7 years ago

@pb-cdunn it seems not to be fixed yet. I have get the same error message like dgordon562. I want to test the Falcon-unzip on arab data, could you give out the link that I can download these data (Col-0,Cvi-0, and F1) ? thanks