GeneDx / pgr-tk

PGR-TK: Pangenome Research Tool Kit
MIT License
93 stars 13 forks source link

pgr-query returns no hits for known sequence #23

Closed mozack closed 1 year ago

mozack commented 1 year ago

Hi,

I am attempting to use pgr-query to search the human pangenome draft. Known exact hits are not being found. For example:

pgr-query pangenome_draft1_v1.1 test1.fa test1_output

returns no hits and an empty fasta:

cat test1_output.hit
#out_seq_name   ctg_bgn ctg_end color   q_name  orientation     idx     q_idx   query_bgn       query_end       q_len  aln_anchor_count

The input sequence is definitely present:

cat test1.fa
>test1
CAGAATGGACCTTCTCCACCAGGAGAGGCTTCCAAGTGACTTGGACGGCATGCTCACTGAGCCCTTGGACTGTGACATGG

Querying a contig with agc returns the sequence:

agc getctg pangenome_draft1_v1.1.agc HG002#2#JAHKSD010000001.1@HG002.maternal.f1_assembly_v2_genbank.fa | grep -n CAGAATGGACCTTCTCCACCAGGAGAGGCTTCCAAGTGACTTGGACGGCATGCTCACTGAGCCCTTGGACTGTGACATGG
3:CAGAATGGACCTTCTCCACCAGGAGAGGCTTCCAAGTGACTTGGACGGCATGCTCACTGAGCCCTTGGACTGTGACATGG

What might cause this? Is there a minimum query sequence length?

cschin commented 1 year ago

The PGR-TK is not designed for searching for short sequencing. If you read the algorithm section, it needs the sequence long enough for creating the minimizer anchors. Please take a look at the provided examples. You can increase the windows of the sequences that you want to search to get results.

mozack commented 1 year ago

Ok, thanks.