estebanpw / chromeister

A dotplot generator for large chromosomes
GNU General Public License v3.0
39 stars 4 forks source link

parallel analysis #18

Closed Karimi-81 closed 2 years ago

Karimi-81 commented 3 years ago

Hi There, I am going to compare two large genome assemblies (each ~ 2.6 Gb). I am interested to use allVsAll.sh for this purpose but I have two concerns: 1) I did not separate chromosomes and there are two whole genomes in my genome directory. I wonder if this is a correct way to do such analysis. 2) I would like to use parallel analysis and you mentioned that it can be possible by re-issuing the command but I want to submit the job to server (slurm job), so I wonder how I can handle that. Here is the my command: allVsAll.sh /chromeister/genomes/ fasta dim 2000 kmer 32 Thank you for your support. Karim

estebanpw commented 3 years ago

Hello @Karimi-81

One question, how many fasta files are there for the two genomes? If there are only two (e.g. genome assembly A and genome assembly B) then you can compare them with the usual command: ./CHROMEISTER -query genomeAssemblyA.fasta -db genomeAssemblyB.fasta -out assemblyAB.mat -dimension 2000 && Rscript compute_score.R assemblyAB.mat 2000

That command will also work for multi fasta files. Now, if there are too many sequences (e.g. thousands and thousands) then you can remove the grid which might otherwise blurr the plot by using compute_score-nogrid.R instead of compute_score.R in the previous command.

Otherwise, if you have LOTS of fasta files there are two options: 1) you can easilty concatenate them using the cat command (e.g. cat *.fasta >> genomeAssemblyX.fasta) or 2) you can use the script you were referring to by running: allVsAll.sh /chromeister/genomes/ fasta 2000 kmer 32 4 (dont forget the 4 at the end)

That will run an all vs all comparisons and generate a lot of files. Eventually when it completes, it will also generate a csv file including information about each comparison. As to how to run this in parallel, simply re-issue the allVsAll.sh /chromeister/genomes/ fasta 2000 kmer 32 4 (if you are in slurm). The only important thing here is that it must be the same command: the allVsAll.sh script will check if there are files corresponding to the other executions (I know this is a bit of an awkward parallel implementation, but in the future I will make a map approach : -) ).

Important: Lastly, make sure to update your repository, I just pushed an update.

Let me know if this was of help, Esteban