droSim1 15Way conservation

atalgarni commented 6 years ago

This is a practical guidance for generating the most conserved elements of drosophila simulans

[ ] Prepare the working directory.
[ ] Prepare the genome data.
[ ] Set the following parameters for generating the pairwise alignment using LASTZ aligner: http://genomewiki.ucsc.edu/index.php/Dm6_27-way_conservation_lastz_parameters
[ ] Do pairwise alignment of the target species {droSim1} with each species {dm3,droYak2,.....,etc}.

To_ generate a pairwise alignment of droSim1.chr2L with droYak2.chr2L I used the following command 👍 lastz /Users/abdulmalek/Documents/DroSim1.phastcons/Genome/droSim1/chr2L.fa /Users/abdulmalek/Documents/DroSim1.phastcons/Genome/droYak2/chr2L.fa M=0 K=2200 L=4000 H=2000 Y=3400 T=1 --scores=DroX.q --format=maf > Sim1.Yak22.maf or another version:

lastz /Users/abdulmalek/Documents/DroSim1.phastcons/Genome/droSim1/chr2L.fa /Users/abdulmalek/Documents/DroSim1.phastcons/Genome/droYak2/chr2L.fa --hspthresh=2200 --gappedthresh=4000 --inner=2000 --ydrop=3400 --seed=12of19 --scores=DroX.q --format=maf > Sim1.Yak2.maf

*_Add a prefix{species name} name to the chromosome no# , for dm3 they all can be: >dm3.chr_**

>for f in *.fa;do
> cat $f | sed -e "s/^>/>dm3./" > $f.new.fa;
> done

[ ] Chaining the pairwise alignment.
[ ] Netting The chained files.
[ ] MAFFing the best chained & Netted files.
[ ] Multiple Alignment : need a phylogenetic tree.
[ ] Run PhastCons ........

atalgarni commented 6 years ago

#!/bin/bash

cd Documents/phastcons/ perl partitionSequence.pl 10010000 10000 /Users/abdulmalek/Documents/phastcons/Genome/droSim1/droSim1.2bit /Users/abdulmalek/Documents/phastcons/Genome/droSim1/droSim1.chrom.sizes -xdir xdir.sh -rawDir ../psl 100 -lstDir tParts > droSim1.lst

export L1=wc -l < droSim1.lst

perl partitionSequence.pl 10010000 10000 /Users/abdulmalek/Documents/phastcons/Genome/droYak2/droYak2.2bit /Users/abdulmalek/Documents/phastcons/Genome/droYak2/droYak2.chrom.sizes -xdir xdir.sh -rawDir ../psl 100 -lstDir qParts > droYak2.lst

export L2 = wc -l < droYak2.lst export L=echo $L1 $L2 | awk '{print $1*$2} echo "cluster batch jobList size: $L = $L1 * $L2"

if [ -d tParts ]; then echo 'constructing tParts/*.2bit files ls tParts/*.lst | sed -e ’s#tParts/##; s#.lst##;' | while read tPart do sed -e 's#.*.2bit:##;' tParts/$tPart.lst \ | twoBitToFa -seqList=stdin /Users/abdulmalek/Documents/phastcons/Genome/droSim1/droSim1.2bit stdout \ | faToTwoBit stdin tParts/$tPart.2bit done fi

if [ -d qParts ]; then echo 'constructing qParts/*.2bit files ls qParts/*.lst | sed -e 's#qParts/##; s#.lst##;' | while read qPart do sed -e 's#.*.2bit:##;' qParts/$qPart.lst \ | twoBitToFa -seqList=stdin /Users/abdulmalek/Documents/phastcons/Genome/droYak2/droYak2.2bit stdout \ | faToTwoBit stdin qParts/$qPart.2bit done fi

atalgarni commented 6 years ago

https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/HQOMmNzdDPc

atalgarni commented 6 years ago

This is how I loop and run lastz on each partitioned 2bit file of target with 2bit files of query

for i in tParts/*.2bit; do for j in qParts/*.2bit; do lastz "$i"[multiple] $j M=0 K=2200 L=4000 H=2000 Y=3400 T=1 --scores=DroX.q --format=maf > MAF/basename $i .2bit-basename $j .2bit.maf; done; done do not forget to add {`} before basname

atalgarni commented 6 years ago

Do not forget the ` before basename

atalgarni commented 6 years ago

After partitioning the genome make sure to copy the original lst file of the genome to the tParts or the qParts. then run this if [ -d tParts ]; then echo 'constructing tParts/.2bit files ls tParts/.lst | sed -e ’s#tParts/##; s#.lst##;' | while read tPart do `sed -e 's#..2bit:##;' tParts/$tPart.lst | twoBitToFa -seqList=stdin /Users/abdulmalek/Documents/phastcons/Genome/droSim1/droSim1.2bit stdout | faToTwoBit stdin tParts/$tPart.2bit done fi` This will output the lst list as .2bit file.

convert the 2bit file to fa file and then faSplit byname to ensure each partitioned chunk in a single file. after that run this command for i in *.fa; do faToTwoBit "$i" ${i%.*}.2bit; done
this command will convert all the single files into 2bit files again to save space and then run the pairwise alignment.

atalgarni commented 6 years ago

Read this thread https://groups.google.com/a/soe.ucsc.edu/forum/m/#!msg/genome/BZjUyNgRKnM/ESMVNVvJDQAJ

atalgarni commented 6 years ago

After I finished chaining and netting the pairwise alignment of D. simulans against D.melanogaster I recheck the total number of chains in order to validate the steps undertaken until this stage. I found a discrepancy output !! after comparing it with genome browser data...

In the following thread I presented the problem to the genome browser USCS team :

Hi there,

I am trying to generate whole genome alignment for droSim1 genome with dm3. I know there is already a chaining file of droSim1.dm3 on UCSC data download, however, I used it as a control output for the pipeline that I adapted following the parameters used to generate the USCS droSim1.dm3.chain file. The problem I am having is that my chaining output file has less chain number than UCSC chaining file. I counted the chain word number like this: grep chain sim1.dm3.chain | wc -l the number counts for generated file 1234692

vs.

grep chain droSim1.dm3.all.chain | wc -l the number counts for UCSC chaining file 1334855

Note: I used chromFaMasked files from both genomes !! which are described as the assembly sequence in one file per chromosome.Repeats are masked by capital Ns; non-repeating sequence is shown in upper case.

For the parameters for the pairwise alignment and chaining : were followed as in http://hgdownload.soe.ucsc.edu/goldenPath/dm3/vsDroSim1/

The difference is 100163 !!! I don't know why is this happening. Is it okay for me to carry on to the next step ?!!!

Best regards, Abdulmalek

USCS Responded:

Hi Abdulmalek,

Thank you for your question about chaining droSim1 and dm3. The reason for the drastic difference in the number of chains is because of the hard-masked sequence used as input. If you instead don't use hard-masked sequence, you should obtain similar numbers of chains.

You still may see small differences due to the blastz/lastz version differences between 2006 and today, and/or if you use a different chunk size. The chunk sizes noted in the README you found are different from the parameters used, the full parameters are found here: https://github.com/ucscGenomeBrowser/kent/blob/master/src/hg/makeDb/doc/dm3.txt#L442

Also note that the droSim1.dm3 chain was made using the chainSwap program on the dm3.droSim1 chain and was not computed directly. Because this chain was not computed directly you may notice other small differences in addition to the chunk sizes issue previously mentioned. The chainSwap program is available here: http://hgdownload.soe.ucsc.edu/admin/exe/

Please let us know if you have any further questions!

Thanks,

Christopher Lee UCSC Genomics Institute

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.

atalgarni commented 6 years ago

To do list:

Re-Download un-masked genome file for all the organisms.
Re-Run the pair-wise alignment.

atalgarni / atalgarni_PhasCons

droSim1 15Way conservation #1

This is a practical guidance for generating the most conserved elements of drosophila simulans