Closed dgordon562 closed 8 years ago
git-sym knows how to acquire the data. It's a pretty sweet little tool, actually. Super efficient. It looks for symlinks with a certain naming convention, and then it relies on git-sym.makefile
to fetch the actual files and fill-in the symlinks.
FALCON-integrate knows how to put git-sym into your path, and they're both part of FALCON-integrate. But technically, you only need git-sym.
what wget command should I use to download synth0? (I need this command whether I use git-sym or don't use it.)
But really, you don't need to run that directly. git-sym really is a nice system. It works marvelously well with git.
Thanks, Chris.
I did successfully download and uncompress 3 files:
file with 10 sequences file.1 with 50 sequences file.2 with 1 sequence
file.2 appears to the answer.
Which of these should I do assemblies with? Just file.1 or file and file.1?
Thanks! David
git plus git-sym should work. Are you unable to run git on your system?
FALCON-integrate:master$ ls -l FALCON-examples/run/synth0/data
lrwxrwxrwx 1 cdunn Domain Users 34 Sep 20 08:36 ref.fasta -> ../../../.git-sym/synth0.ref.fasta
lrwxrwxrwx 1 cdunn Domain Users 41 Sep 20 08:36 synth0.fasta -> ../../../.git-sym/synth0-circ-20.pb.fasta
ref.fasta
is a random 5000b genome, assumed circular.synth0.fasta
is 20x coverage of that genome by random-sampling.Hi, Chris,
git runs fine on our system. I've been reading the git-sym documentation and have tried setting up the links but have been unsuccessful. couldn't I instead just run wget and gunzip and have it done in a minute or two? If you insist on git-sym, could you give me the commands for getting the synth0 dataset?
Thanks, David
Either of these should work:
cd FALCON-examples/run/synth0/data
git-sym update
Or:
cd FALCON-examples
git-sym update run/synth0/data
Clearly git-sym is very elegant, but I'm under a lot of pressure. I'll send you the problems with it so far, but it is probably difficult for you to debug this remotely. I'm also under a lot of pressure to do other work so I can't take hours to download a little file. I'll give you the error message below. If you can't immediately see the problem, how about if you just answer the question above:
file with 10 sequences file.1 with 50 sequences file.2 with 1 sequence
file.2 appears to the answer.
Which of these should I do assemblies with? Just file.1 or file and file.1?
Here is the git-sym output:
~/falcon/160201/FALCON-integrate/git-sym/git-sym update -> in dir '.' <- back to dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/run/synth0' symlink: 'data/ref.fasta' symlink: 'data/synth0.fasta' 'data/ref.fasta' -> '../../../.git-sym/synth0.ref.fasta' does not exist 'data/synth0.fasta' -> '../../../.git-sym/synth0-circ-20.pb.fasta' does not exist -> in dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/links' -> in dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/cache' make -j -f /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/git-sym.makefile 'synth0.ref.fasta' wget https://www.dropbox.com/s/jz0m0n2a1b19xyd/from.fasta.gz --2016-02-15 12:58:58-- https://www.dropbox.com/s/jz0m0n2a1b19xyd/from.fasta.gz Resolving www.dropbox.com... 108.160.172.206, 108.160.172.238 Connecting to www.dropbox.com|108.160.172.206|:443... connected. HTTP request sent, awaiting response... 302 FOUND Location: https://dl.dropboxusercontent.com/content_link/4xCoOTCCsLwe8IEVC5iITPlhBkDetiFazxm26CadTb47E1F2d5IY7tZfimfd0dPT/file [following] --2016-02-15 12:58:58-- https://dl.dropboxusercontent.com/content_link/4xCoOTCCsLwe8IEVC5iITPlhBkDetiFazxm26CadTb47E1F2d5IY7tZfimfd0dPT/file Resolving dl.dropboxusercontent.com... 199.47.217.69 Connecting to dl.dropboxusercontent.com|199.47.217.69|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1694 (1.7K) [application/octet-stream] Saving to: “file”
100%[======================================>] 1,694 --.-K/s in 0s
2016-02-15 12:58:59 (284 MB/s) - “file” saved [1694/1694]
gunzip -c from.fasta.gz >| synth0.ref.fasta gzip: from.fasta.gz: No such file or directory make: * [synth0.ref.fasta] Error 1 Traceback (most recent call last): File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 455, in main cmd_table[cmd](args) File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 357, in git_sym_update retrieve(needed) File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 284, in retrieve retrieve_using_make(makefilename, paths) File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 277, in retrieve_using_make system(cmd) File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 81, in system raise Exception('%d <- %r' %(rc, cmd)) Exception: 512 <- "make -j -f /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/git-sym.makefile 'synth0.ref.fasta'"
subsequent attempts give similar errors except:
gunzip -c circ-20.pb.fasta.gz >| synth0-circ-20.pb.fasta gzip: circ-20.pb.fasta.gz: No such file or directory make: *\ [synth0-circ-20.pb.fasta] Error 1
I should also show you that the files in cache are zero length:
ll /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/cache total 128 drwxrwxr-x 4 dgordon eichlerlab 2048 Feb 15 12:55 .. -rw-rw-r-- 1 dgordon eichlerlab 1694 Feb 15 12:58 file -rw-rw-r-- 1 dgordon eichlerlab 0 Feb 15 12:58 synth0.ref.fasta -rw-rw-r-- 1 dgordon eichlerlab 13565 Feb 15 13:02 file.1 -rw-rw-r-- 1 dgordon eichlerlab 0 Feb 15 13:02 synth0-circ-20.pb.fasta drwxrwxr-x 2 dgordon eichlerlab 2048 Feb 15 13:02 .
I don't know what file/file.1/file.2 are. If you're looking for nice quick test data, you need synth0.ref.fasta
and synth0-circ-20.pb.fasta
.
You're having a problem with wget
-- a 302 from dropbox. curl -L
might work better for following the redirect. You can specify the output filename too.
I wish I were allowed to pursue an AWS-based solution for this. Downloading from dropbox can be problematic. But AWS costs money, and I'm not setting that up without a credit card from PacBio.
Thanks, Chris. I think I have it now. synth0-circ-20.pb.fasta has 50 sequences, each with exactly 2000 bp. Sound right?
What do I need synth0.ref.fasta for? It doesn't go into the assembly, does it? Is it to check the answer?
There is a script somewhere called check.py
which can verify that your draft assembly is a rotation of the reference. But no, you don't need the reference.
I'll switch git-sym.makefile to use curl -L
. Thanks for reporting the snags.
Hi, Chris,
Unfortunately, I'm running into the same problem trying to download the arabidopsis data:
I cd to :
cd ~/falcon/160201/FALCON-integrate/FALCON-examples/run/arab/data
I type:
git-sym update
and get:
-> in dir '.'
<- back to dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/run/arab/data'
symlink: 'arab-creads.fasta'
'arab-creads.fasta' -> '../../../.git-sym/arab-creads.fasta' does not exist
-> in dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/links'
-> in dir '/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/.git/modules/FALCON-examples/git-sym-local/cache'
make -j -f /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/git-sym.makefile 'arab-creads.fasta'
cp -f /lustre/hpcprod/cdunn/data/arab_test/corrected.fasta arab-creads.fasta
cp: cannot stat `/lustre/hpcprod/cdunn/data/arab_test/corrected.fasta': No such file or directory
make: *** [arab-creads.fasta] Error 1
Traceback (most recent call last):
File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 455, in main
cmd_table[cmd](**args)
File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 357, in git_sym_update
retrieve(needed)
File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 284, in retrieve
retrieve_using_make(makefilename, paths)
File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 277, in retrieve_using_make
system(cmd)
File "/net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/git-sym/git-sym", line 81, in system
raise Exception('%d <- %r' %(rc, cmd))
Exception: 512 <- "make -j -f /net/gs/vol1/home/dgordon/falcon/160201/FALCON-integrate/FALCON-examples/git-sym.makefile 'arab-creads.fasta'"
It appears you've left some of your own full paths here? Could you give me the dropbox address and I can get it from there rather than using git-sym?
Thanks! David
Hi, Chris,
If you look at:
FALCON-integrate/FALCON-examples/git-sym.makefile
you'll see that arabidopsis doesn't have an associated dropbox--you made it just copy from somewhere else on your filesystem.
Or...am I looking in the wrong place?
Thanks! David
If it will take a while to fix the makefile, how about if I use the following to test Falcon on arabidopsis:
http://datasets.pacb.com.s3.amazonaws.com/2014/Arabidopsis-lyrata/list.html
?
the lyrata may not be the best one for testing. In this site,the ler-0 or the dmel are better than the lyrata.
Why is that? Does it assemble poorly? (ler-0 doesn't have raw fasta files so there is more work to use it for testing. I want a plant.)
The lyrata was a diploid sample and it was hard to get HMW DNA...
Why does that make it not the best for testing? Will it assemble poorly?
what are you testing for? There is no good reference for Lyrata. The reads are shorter too.
check this, you can get the data from SRA http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4365909/#S1
To answer the question on Arabidopsis, I pay for my own Dropbox account, so I can't fill it with large data. I hope to have a bigger-data solution in a couple weeks.
Arabidopsis assembly will take a while, not the best for quickly testing the workflow anyway. Most of the data can be download from SRA. If necessary, we can write some automatically download and test scripts using SRA rather than Dropbox.
Jason, Aaron asked IT to get me an AWS account, but they are thinking we could use our Internet Colocation Site instead. We should have a solution for this within a couple of weeks.
@pb-cdunn it seems not to be fixed yet. I have get the same error message like dgordon562. I want to test the Falcon-unzip on arab data, could you give out the link that I can download these data (Col-0,Cvi-0, and F1) ? thanks
Hi, Chris,
I'm sure there is something I need to run to populate these directories and I just don't know what it is (make something...)?
In FALCON-integrate/FALCON-examples/run are nice directories like this:
synth0 contains:
but when you look in "data", you just get dead links to a directory that doesn't exist:
It is probably something obvious I haven't done...
Thanks! David
P.S. I'm trying to follow your advice of testing with synth0.