estebanpw / chromeister

A dotplot generator for large chromosomes
GNU General Public License v3.0
39 stars 4 forks source link

Not finding any synteny in test data #31

Open theo-allnutt-bioinformatics opened 1 year ago

theo-allnutt-bioinformatics commented 1 year ago

Hi,

I was running CHROMEISTER on a plant genome against itself (~1 Gbp) where I expected to see some synteny between chromosomes, but did not see any. I then tested CHROMEISTER on a chloroplast genome (which has two inverted repeat regions) and again saw nothing. I then decided to copy parts of the cp chromosome into test contigs - both forward and reverse complement of ~5kb. This too showed no synteny, just a dot plot with three gaps (fig1 below). What is the minimum size of a block where you would expect to see synteny (I would have thought 5kb would be large enough?).

I then repeated the test with the whole test sequence reverse complemented and as added as another chromosome: CHROMEISTER -dimension 2000 -kmer 16 -query cptest2.fasta -db cptest2.fasta -out cpdotplot.mat && Rscript /g/data/nm31/bin/chromeister/bin/compute_score.R cpdotplot.mat 2000

[INFO] Generating a 1000x1000 matrix [INFO] Loading database 100%...[INFO] Database loaded and of length 374340. [INFO] Ratios: Q [3.743400e+02] D [3.743400e+02]. Lenghts: Q [374340] D [374340] [INFO] Pixel size: Q [2.671368e-03] D [2.671368e-03]. [INFO] Computing absolute hit numbers. 100%...Scanning hits table. 99%... [INFO] Query length 374340. [INFO] Writing matrix. [INFO] Found 0 unique hits for z = 4. 0.996

..and this produced a 'grey' dotplot, fig2.

fig1 fig2

test data:

cptest2.fasta.txt

So something is not working but I am not getting any errors.

Thanks,

Theo

theo-allnutt-bioinformatics commented 1 year ago

And on repeating above, trying different file names, I get no output at all.. no hits.

CHROMEISTER -query cptest2.fasta -db cptest2copy.fasta -out cpdotplot.mat && Rscript /g/data/nm31/bin/chromeister/bin/compute_score.R cpdotplot.mat 1000

[INFO] Generating a 1000x1000 matrix [INFO] Loading database 100%...[INFO] Database loaded and of length 374340. [INFO] Ratios: Q [3.743400e+02] D [3.743400e+02]. Lenghts: Q [374340] D [374340] [INFO] Pixel size: Q [2.671368e-03] D [2.671368e-03]. [INFO] Computing absolute hit numbers. 100%...Scanning hits table. 99%... [INFO] Query length 374340. [INFO] Writing matrix. [INFO] Found 0 unique hits for z = 4. 0.996

theo-allnutt-bioinformatics commented 1 year ago

Sorry, have just noticed, no issues resolved since June 2022. I guess this program is no longer supported.

estebanpw commented 1 year ago

Hello @theo-allnutt-bioinformatics !

Unfortunately, as you mentioned, the program is no longer maintained. However I had some time to look at the problem you are reporting.

It is wierd that no synteny at all is found, but it could be possible. Your execution of the fasta cptest2.fasta shows that 0 unique hits are found and that is why the plot is empty. Could it be that cptest2.fasta was generated artificially?

The problem is that CHROMEISTER is a highly heuristic method that relies on finding small and unique words, which are called unique hits (by default a combination of a fixed 12-mer and its heuristic value). If, for example, a fasta file contained only multiple repeated substrings, then no unique hits could be found.

I did not expect this to be the case, but confirmed it by slightly mutating you cptest2.fasta and then running it against itself. If you want to try, you can use this AWK script to mutate at a rate of 1%:

awk 'BEGIN{val[1]="A"; val[2]="C"; val[3]="G"; val[4]="T"; } /^>/{printf ("%s\n", $0);} /^[^>]/{split($0, arr, ""); for(i=1; i<length(arr); i+=1) if (rand() > 0.01) printf("%c", arr[i]); else printf("%c", val[ int( rand() * 4 + 1)]); printf("\n");}' cptest2.fasta.txt > mutated.fasta

(The script will take all nucleotides and replace approx. 1 in every 100 with a different one)

The new mutated.fasta file has now unique hits since some words, even if they were highly repeated in the original file, now contain a single point mutation that is detected as a unique hit. See:

[INFO] Generating a 1000x1000 matrix
[INFO] Loading database
99%...[INFO] Database loaded and of length 374332.
[INFO] Ratios: Q [3.743320e+02] D [3.743320e+02]. Lenghts: Q [374332] D [374332]
[INFO] Pixel size: Q [2.671425e-03] D [2.671425e-03].
[INFO] Computing absolute hit numbers.
99%...Scanning hits table.
99%...
[INFO] Query length 374332.
[INFO] Writing matrix.
[INFO] Found 2046 unique hits for z = 4.
0.016

Note the 2046 unique hits. When plotting, it looks like the following:

imagen

Where only a diagonal synteny is observed since we are comparing one sequence against itself, and by definition CHROMEISTER removes all possible repeats.

If you suspect that your original plant genomes contain only repeated substrings, then you can try to use the mutation trick (although for a 1 GB file the AWK command can take ~20 min) and then rerun it through CHROMEISTER. But this sounds like a long shot to me - are you completely sure that there should be a large synteny? How many "unique hits" does CHROMEISTER report on them?

I hope this is helpful! Esteban

theo-allnutt-bioinformatics commented 1 year ago

Hi,

thanks for your reply. The test data consists of direct and inverted long repeats (as most chloroplast and mitochondrial genomes do), so the dot plots should show this. I understand your explanation, but unfortunately this means that your program is not usable for me. I think that it may also be a large problem for other users - it is quite frequent to find large, perfectly, or almost perfectly identical synteny in genomes - for examples in artificially generated polyploids or induced mutants. The inability to detect these syntenies is a major issue. I think that you should perhaps point this out at the top of your readme / manual, so that potentially costly mistakes are not made. For example, I had already run chromeister on several genomes and found no synteny. I will now have to repeat that work. It might also be advisable to contact the publishing journal and add an addendum to the paper.

Thanks again for your help,

Theo Allnutt

theo-allnutt-bioinformatics commented 1 year ago

Also:

I think it is important to show unmasked repeats when required - because it is not possible to distinguish these from synteny. The best solution is to allow users to first mask the repeats they are not interested in (e.g. TEs) then look for synteny on the remaining (including repeats within contigs). If you do not make it clear that this program will only work with 'evolved' synteny on different molecules, then others may make the same mistake as I did and assume no synteny was present.

estebanpw commented 1 year ago

Hello @theo-allnutt-bioinformatics !

Thanks for your reply. Could you suggest two large, publicly available chloroplast / mitochondrial genomes (e.g. NCBI)? I would like to try something out that could maybe improve CHROMEISTER, and if it does not, then at last I can add a warning and a specific example.

Thanks Esteban

theo-allnutt-bioinformatics commented 1 year ago

Hi,

I think you may have misunderstood my point. Although I used a chloroplast genome in my example, it could be any genome with a repeat. You could make an artificial molecule (or take any sequence) and directly copy it into a tandem or inverted repeat, and in its current form, CHROMEISTER would not detect it, as shown in my example.

Here is another example. The sequence file is:

MT726210.1 Brassica rapa chloroplast, complete genome NC_028272.1 Brassica juncea chloroplast, complete genome KT581449.1 Brassica juncea chloroplast, complete genome

brassica_chloroplast.fasta.txt

These two species are closely related, so direct synteny should be visible. And obviously the two B. juncea chloroplast genomes should be even more so. We should also be able to see the long inverted repeats, which most chloroplasts have. However, the dot plot is 'grey':

cpdotplot mat filt

I hope this illustrates the problem.

Theo