Rfam / rfam-website

Rfam website source code
https://rfam.org
Apache License 2.0
5 stars 2 forks source link

Misleading labels in species sunburst #7

Open AntonPetrov opened 7 years ago

AntonPetrov commented 7 years ago

Example

Species sunburst for Clostridia in RF01315 shows that there are 64 sequences:

screen shot 2016-02-19 at 17 15 47

An example SQL query confirming the number of sequences:

SELECT CONCAT(t1.rfamseq_acc, '/', seq_start, '-', `seq_end`)
FROM full_region t1, rfamseq t2, taxonomy t3
WHERE t1.rfam_acc = 'RF01315' 
AND t1.rfamseq_acc = t2.rfamseq_acc
AND t2.ncbi_id = t3.ncbi_id
AND t3.tax_string LIKE '%Clostridia;%'
AND is_significant = 1
GROUP BY rfamseq_acc;

64 rows (like in sunburst UI) - note the GROUP BY clause

However, there are many more annotated regions:

SELECT CONCAT(t1.rfamseq_acc, '/', seq_start, '-', `seq_end`)
FROM full_region t1, rfamseq t2, taxonomy t3
WHERE t1.rfam_acc = 'RF01315' 
AND t1.rfamseq_acc = t2.rfamseq_acc
AND t2.ncbi_id = t3.ncbi_id
AND t3.tax_string LIKE '%Clostridia;%'
AND is_significant = 1;

6222 rows - no GROUP BY clause

So the number of entries in the resulting FASTA file is inconsistent with sunburst UI.