EvolBioInf / fur

Find Unique genomic Regions
29 stars 3 forks source link

How to select the neighbors genomes? #13

Open wangzhichao1990 opened 11 months ago

wangzhichao1990 commented 11 months ago

Hi,

Thanks for developing this software.

I would like to ask how to select the neighbors genomes. Sometimes there are no sequences in the results of fur.

Thanks.

haubold commented 11 months ago

We are developing the package neighbors for that. The Tutorial in its documentation might help you complete a given set of targets and to find the corresponding neighbors.

wangzhichao1990 commented 11 months ago

We are developing the package neighbors for that. The Tutorial in its documentation might help you complete a given set of targets and to find the corresponding neighbors.

Thank you for your reply. Is this package currently available? I basically choose other genomes of the same genus as the neighbors, maybe I need to change the - q parameter.

haubold commented 11 months ago

Yes, that package is available. A common problem with fur is mixture between target and neighbor genomes. To make sure this doesn't happen, one should reconstruct a phylogeny from the genomes. If you'd like to see how this is done, take a look at the tutorial in the neighbors package.

wangzhichao1990 commented 10 months ago

Yes, that package is available. A common problem with fur is mixture between target and neighbor genomes. To make sure this doesn't happen, one should reconstruct a phylogeny from the genomes. If you'd like to see how this is done, take a look at the tutorial in the neighbors package.

Is there a way to automatically correct the target and neighbor genomes? manual inspection is a time-consuming step and I have many species to construct these genomes. Looking forward to your reply. Thank you.

haubold commented 10 months ago

There's a semi-automatic way, which scales well to hundreds of bacterial genomes. It is described in the tutorial of the neighbors package. It consists of the following steps:

  1. Download target and neighbor genomes as listed by neighbors
  2. Construct tree from genomes
  3. Label tree nodes with and inspect tree to pick the target clade
  4. List the leaves in the target clade and automatically create symbolic links for them to the targets directory
  5. Repeat for the neighbors
  6. Run makeFurDb followed by fur

If this sounds promising to you, take a look at the neighbors tutorial for the details.

wangzhichao1990 commented 10 months ago

Thank you for your reply. I have read the documentation for this package. Actually, I want to automate step 3. At present, I don't have a good idea.

haubold commented 10 months ago

It is a good question, here is a sketch of an answer: Let's say the leaves in the tree are labelled with prefix 'n' for taxonomic neighbors and prefix 't' for taxonomic targets, as explained in the neighbors tutorial. We are looking for the node, v, that maximizes the number of targets in the subtree rooted on v and the number of neighbors elsewhwere.

For example, the tree file o157.txt contains 449 E. coli genomes, with the pathogen O157:H7 as the target. We iterate over the 448 internal nodes, which were labeled with lande, and for each node calculate the node score consisting of the sum of the targets found in that node and the neighbors found outside:

for a in $(seq 448); do
t=$(pickle $a o157.txt | grep '^t' | wc -l)
n=$(pickle -c $a o157.txt | grep '^n' | wc -l)
((s=$t+$n))
echo $a $s
done > score.txt

We sort the scores to find that the desired node is 300, which is also the one I'd pick from visual inspection.

sort -n -k 2 -r score.txt | head

Hope this helps.

haubold commented 10 months ago

Just in case this is still of interest, the Neighbors package now also contains the program fintac for finding the target clade according to the procedure sketched above.