blobtoolkit / blobtoolkit

Interactive quality assessment of genome assemblies
http://blobtoolkit.genomehubs.org
MIT License
85 stars 11 forks source link

Request for AND not OR filtering #147

Open ChristophePatterson opened 1 year ago

ChristophePatterson commented 1 year ago

Hi,

I'd like to request for the inclusion of "AND" filtering within the filter function of blobtools. The current set up can only accommodate "OR". If this feature is already included my apologies, but would it be possible to document it within the tutorial found here: https://blobtoolkit.genomehubs.org/blobtools2/blobtools2-tutorials/filtering-a-dataset/.

Currently, I can only filter data by OR. I would like to filter all contigs within my assembly that have a greater than 90% match to my mitogenome AND have a length shorter than the mitogenome. The following code would suggest that it is filtering by AND but is actually filtering by OR (This code removes all contigs a greater than 80% match to my mitogenome OR have a length shorter than the mitogenome.)

blobtools add \
    --text blastn_mitogenome_hap1.out \
    --text-cols 5=identifiers,6=percent_mitomatch \
    $Hap1_dir-uncont

blobtools filter \
    --query-string "percent_mitomatch--Max=90.0&length--Min=17701" \
    --output $Hap1_dir-uncont-mito \
    --fasta $Hap1_fasta.uncont.filtered.fa \
    $Hap1_dir-uncont

I have gotten around this by writing my own R script, that takes the blastn_mitogenome_hap1.out file (that contains the slen output) and creating a new .out file that contains a TRUE or FALSE statement for my query. This can then be filtered out using blobtools. If anyone else is interested in this code I have attached it below.

The inclusion of AND not OR is required for the filtering out of mitogenomes from the assembly but from my understanding, this is not currently possible within blobtools.

Many thanks,

Christophe

#######################

Custom R script

# Custom R code to create an file that states which contigs 
# Have a greater than 90 match to the mitogenome AND 
# are shorter than the mitogenome

# read in mitogenome blast results
blast_data_hap1 <- read.table("blastn_mitogenome_hap1.out", header = F)
blast_data_hap2 <- read.table("blastn_mitogenome_hap2.out", header = F)
# Read in mitogenome length
mitolength <- read.table("mitogenome_length.txt")

blast_data_hap1$V16 <- blast_data_hap1[,1]<=mitolength[2,]&blast_data_hap1[,7]>=90
blast_data_hap2$V16 <- blast_data_hap2[,1]<=mitolength[2,]&blast_data_hap2[,7]>=90

write.table(blast_data_hap1, "blastn_mitogenome_hap1_AND.out", row.names = F, col.names = F)
write.table(blast_data_hap2, "blastn_mitogenome_hap2_AND.out", row.names = F, col.names = F)

Then inputting into blobtools

blobtools add \
    --text blastn_mitogenome_hap2_AND.out \
    --text-cols 6=identifiers,16=percent_length_mitomatch \
    $Hap2_dir-uncont

echo "Filtering out potential mitogenome contigs"

blobtools filter \
    --param percent_length_mitomatch--Keys=FALSE,NA \
    --invert \
    --output $Hap1_dir-uncont-mito \
    --fasta $Hap1_fasta.uncont.filtered.fa \
    $Hap1_dir-uncont
rjchallis commented 1 year ago

Hi Christophe

Thanks for the suggestion and for a workaround. Somehow no-one asked for and AND filter until now. I agree it would be a useful feature to have so I'll give this some thought.

There is a workaround in the Viewer involving applying the opposite filters, selecting all records, removing the filters then inverting the selection - not very elegant!

There should also be a workaround on the command line, but the --list option to blobtools filters doesn't appear to be working to use that way.

I should be able to get a working AND filter for the command line relatively easily. I can't immediately think of a good way to include it in the Viewer as treating filters as OR goes pretty deep into the design, but hopefully it just needs a little more thought to find a solution.

Rich