Open atalgarni opened 6 years ago
#!/bin/bash
cd Documents/phastcons/
perl partitionSequence.pl 10010000 10000
/Users/abdulmalek/Documents/phastcons/Genome/droSim1/droSim1.2bit
/Users/abdulmalek/Documents/phastcons/Genome/droSim1/droSim1.chrom.sizes -xdir xdir.sh -rawDir ../psl 100 -lstDir tParts > droSim1.lst
export L1=wc -l < droSim1.lst
perl partitionSequence.pl 10010000 10000 /Users/abdulmalek/Documents/phastcons/Genome/droYak2/droYak2.2bit /Users/abdulmalek/Documents/phastcons/Genome/droYak2/droYak2.chrom.sizes -xdir xdir.sh -rawDir ../psl 100 -lstDir qParts > droYak2.lst
export L2 = wc -l < droYak2.lst
export L=echo $L1 $L2 | awk '{print $1*$2}
echo "cluster batch jobList size: $L = $L1 * $L2"
if [ -d tParts ]; then
echo 'constructing tParts/*.2bit files
ls tParts/*.lst | sed -e ’s#tParts/##; s#.lst##;' | while read tPart
do
sed -e 's#.*.2bit:##;' tParts/$tPart.lst \
| twoBitToFa -seqList=stdin /Users/abdulmalek/Documents/phastcons/Genome/droSim1/droSim1.2bit stdout \ | faToTwoBit stdin tParts/$tPart.2bit
done
fi
if [ -d qParts ]; then
echo 'constructing qParts/*.2bit files
ls qParts/*.lst | sed -e 's#qParts/##; s#.lst##;' | while read qPart
do
sed -e 's#.*.2bit:##;' qParts/$qPart.lst \
| twoBitToFa -seqList=stdin /Users/abdulmalek/Documents/phastcons/Genome/droYak2/droYak2.2bit stdout \ | faToTwoBit stdin qParts/$qPart.2bit
done
fi
This is how I loop and run lastz on each partitioned 2bit file of target with 2bit files of query
for i in tParts/*.2bit; do for j in qParts/*.2bit; do lastz "$i"[multiple] $j M=0 K=2200 L=4000 H=2000 Y=3400 T=1 --scores=DroX.q --format=maf > MAF/
basename $i .2bit-
basename $j .2bit.maf; done; done
do not forget to add {`} before basname
Do not forget the ` before basename
After partitioning the genome make sure to copy the original lst file of the genome to the tParts or the qParts. then run this
if [ -d tParts ]; then
echo 'constructing tParts/.2bit files
ls tParts/.lst | sed -e ’s#tParts/##; s#.lst##;' | while read tPart
do
`sed -e 's#..2bit:##;' tParts/$tPart.lst
| twoBitToFa -seqList=stdin /Users/abdulmalek/Documents/phastcons/Genome/droSim1/droSim1.2bit stdout | faToTwoBit stdin tParts/$tPart.2bit
done
fi`
This will output the lst list as .2bit file.
for i in *.fa; do faToTwoBit "$i" ${i%.*}.2bit; done
After I finished chaining and netting the pairwise alignment of D. simulans against D.melanogaster I recheck the total number of chains in order to validate the steps undertaken until this stage. I found a discrepancy output !! after comparing it with genome browser data...
In the following thread I presented the problem to the genome browser USCS team :
Hi there,
I am trying to generate whole genome alignment for droSim1 genome with dm3. I know there is already a chaining file of droSim1.dm3 on UCSC data download, however, I used it as a control output for the pipeline that I adapted following the parameters used to generate the USCS droSim1.dm3.chain file. The problem I am having is that my chaining output file has less chain number than UCSC chaining file.
I counted the chain word number like this:
grep chain sim1.dm3.chain | wc -l
the number counts for generated file 1234692
vs.
grep chain droSim1.dm3.all.chain | wc -l
the number counts for UCSC chaining file 1334855
Note: I used chromFaMasked files from both genomes !! which are described as the assembly sequence in one file per chromosome.Repeats are masked by capital Ns; non-repeating sequence is shown in upper case.
For the parameters for the pairwise alignment and chaining : were followed as in http://hgdownload.soe.ucsc.edu/goldenPath/dm3/vsDroSim1/
The difference is 100163 !!! I don't know why is this happening. Is it okay for me to carry on to the next step ?!!!
Best regards, Abdulmalek
USCS Responded:
Hi Abdulmalek,
Thank you for your question about chaining droSim1 and dm3. The reason for the drastic difference in the number of chains is because of the hard-masked sequence used as input. If you instead don't use hard-masked sequence, you should obtain similar numbers of chains.
You still may see small differences due to the blastz/lastz version differences between 2006 and today, and/or if you use a different chunk size. The chunk sizes noted in the README you found are different from the parameters used, the full parameters are found here: https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/dm3.txt#L442
Also note that the droSim1.dm3 chain was made using the chainSwap program on the dm3.droSim1 chain and was not computed directly. Because this chain was not computed directly you may notice other small differences in addition to the chunk sizes issue previously mentioned. The chainSwap program is available here: http://hgdownload.soe.ucsc.edu/admin/exe/
Please let us know if you have any further questions!
Thanks,
Christopher Lee UCSC Genomics Institute
Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.
To do list:
This is a practical guidance for generating the most conserved elements of drosophila simulans
[ ] Prepare the working directory.
[ ] Prepare the genome data.
[ ] Set the following parameters for generating the pairwise alignment using LASTZ aligner: http://genomewiki.ucsc.edu/index.php/Dm6_27-way_conservation_lastz_parameters
[ ] Do pairwise alignment of the target species {droSim1} with each species {dm3,droYak2,.....,etc}.
To_ generate a pairwise alignment of droSim1.chr2L with droYak2.chr2L I used the following command 👍
lastz /Users/abdulmalek/Documents/DroSim1.phastcons/Genome/droSim1/chr2L.fa /Users/abdulmalek/Documents/DroSim1.phastcons/Genome/droYak2/chr2L.fa M=0 K=2200 L=4000 H=2000 Y=3400 T=1 --scores=DroX.q --format=maf > Sim1.Yak22.maf
or another version:lastz /Users/abdulmalek/Documents/DroSim1.phastcons/Genome/droSim1/chr2L.fa /Users/abdulmalek/Documents/DroSim1.phastcons/Genome/droYak2/chr2L.fa --hspthresh=2200 --gappedthresh=4000 --inner=2000 --ydrop=3400 --seed=12of19 --scores=DroX.q --format=maf > Sim1.Yak2.maf
*_Add a prefix{species name} name to the chromosome no# , for dm3 they all can be: >dm3.chr_**
[ ] Chaining the pairwise alignment.
[ ] Netting The chained files.
[ ] MAFFing the best chained & Netted files.
[ ] Multiple Alignment : need a phylogenetic tree.
[ ] Run PhastCons ........