bebatut / PylProtPredictor

Prediction of PYL proteins
http://bebatut.fr/PylProtPredictor/
Apache License 2.0
0 stars 0 forks source link

Check prediction results #4

Open bebatut opened 7 years ago

bebatut commented 7 years ago

Hi,

We need to check the results of the prediction.

@keuv-grvl, @ylana Any idea how to do that?

Some ideas:

We need to do that automatically (same thing to extract the genomes in data)

Bérénice

keuv-grvl commented 7 years ago

I'd bet an expert review will be mandatory since there is no authoritative pyrrolysine-containing protein database. So we need some positive and negative controls, which could be provided by experts.

One strategy for positive control could be :

  1. Select some known Pyl proteins (FASTA)
  2. Align these proteins on whole genomes (using tblastn)
  3. Download the matching genomes or scaffolds (if available)
  4. Run the predictor

For negative controls, expertly selected bacterial genomes should be sufficient.

bebatut commented 7 years ago

It could be a good idea. Better if we could automatize all the tasks: to limit any manual intervention and if the users want to test the tool. Do you think we can do that?

keuv-grvl commented 7 years ago

I ran some tests about validation (not really successful though).

I build a tiny protein database containing 29 known Pyl proteins (FASTA). Here is the DB.

  1. I ran multiple times the complete pipeline using Mx1201 and my test DB. It took ~3.5 seconds each time.
  2. I calculated the sha512sum of each results/test/conserved_potential_pyl_sequences.fasta

sha512sums differed. This means we cannot compare an expected result with a calculated result by comparing the files. Instead, we have to check the content of the files.