bioinfo-ut / PhenotypeSeeker

Identify phenotype-specific k-mers and predict phenotype using sequenced bacterial strains
GNU General Public License v3.0
18 stars 10 forks source link

inconsistency between chi2_results and k-mers_and_coefficients_in_log_reg_model #20

Closed rtnakano1984 closed 2 years ago

rtnakano1984 commented 2 years ago

Hallo! Thanks for building such a nice pipeline! I have used this for my own bacterial genome dataset with binary trait data, and it worked pretty well. However, I found an inconsistency between chi2_results and k-mers_and_coefficients_in_log_reg_model and having a hard time to interpret.

In chi2_results, I have a number of kmers that are specifically found in one group of genomes but not in the others. This makes perfect sense and the list of strains that have these kmers also nicely match with the trait data. This was very promising. Now, when I looked at the k-mers_and_coefficients_in_log_reg_model file, I also found a number of kmers, but it says ALL kmers are present in ALL genomes. The "No._of_samples_with_k-mer" column showed the same number for all kmers, and this number is the total number of genomes that I fed into the pipeline. I directly compared these two files and found that indeed the kmers shown in the log_reg_model are those that showed low P values in the chi2_results, while the numbers in "num_samples_w_kmer" column in the chi2_results.txt and the "No._of_samples_with_k-mer" column in the k-mers_and_coefficients_in_log_reg_model.txt do not match with each other, even for the kmers with identical sequences.

Am I getting something utterly wrong or missing something? It would be very much appreciated if you could walk me through why this could happen!

Many many thanks, Thomas

erkiaun commented 2 years ago

Hi Ryohei!

I am going to look into it. Probably it is some kind a bug and the info in chi2 results is the correct one. Could you please specify which version do you have (phenotypeseeker --version)?

Best regards, Erki

On Thu, 3 Feb 2022 at 17:21, Ryohei Thomas Nakano @.***> wrote:

Hallo! Thanks for building such a nice pipeline! I have used this for my own bacterial genome dataset with binary trait data, and it worked pretty well. However, I found an inconsistency between chi2_results and k-mers_and_coefficients_in_log_reg_model and having a hard time to interpret.

In chi2_results, I have a number of kmers that are specifically found in one group of genomes but not in the others. This makes perfect sense and the list of strains that have these kmers also nicely match with the trait data. This was very promising. Now, when I looked at the k-mers_and_coefficients_in_log_reg_model file, I also found a number of kmers, but it says ALL kmers are present in ALL genomes. The "No._of_samples_with_k-mer" column showed the same number for all kmers, and this number is the total number of genomes that I fed into the pipeline. I directly compared these two files and found that indeed the kmers shown in the log_reg_model are those that showed low P values in the chi2_results, while the numbers in "num_samples_w_kmer" column in the chi2_results.txt and the "No._of_samples_with_k-mer" column in the k-mers_and_coefficients_in_log_reg_model.txt do not match with each other, even for the kmers with identical sequences.

Am I getting something utterly wrong or missing something? It would be very much appreciated if you could walk me through why this could happen!

Many many thanks, Thomas

— Reply to this email directly, view it on GitHub https://github.com/bioinfo-ut/PhenotypeSeeker/issues/20, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEPVJKYC7F5QU6IEVZ5BMLUZKMQNANCNFSM5NPHIPXQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

erkiaun commented 2 years ago

Should be fixed now in the latest version 1.0.2.

Erki

On Thu, 3 Feb 2022 at 18:10, Erki Aun @.***> wrote:

Hi Ryohei!

I am going to look into it. Probably it is some kind a bug and the info in chi2 results is the correct one. Could you please specify which version do you have (phenotypeseeker --version)?

Best regards, Erki

On Thu, 3 Feb 2022 at 17:21, Ryohei Thomas Nakano < @.***> wrote:

Hallo! Thanks for building such a nice pipeline! I have used this for my own bacterial genome dataset with binary trait data, and it worked pretty well. However, I found an inconsistency between chi2_results and k-mers_and_coefficients_in_log_reg_model and having a hard time to interpret.

In chi2_results, I have a number of kmers that are specifically found in one group of genomes but not in the others. This makes perfect sense and the list of strains that have these kmers also nicely match with the trait data. This was very promising. Now, when I looked at the k-mers_and_coefficients_in_log_reg_model file, I also found a number of kmers, but it says ALL kmers are present in ALL genomes. The "No._of_samples_with_k-mer" column showed the same number for all kmers, and this number is the total number of genomes that I fed into the pipeline. I directly compared these two files and found that indeed the kmers shown in the log_reg_model are those that showed low P values in the chi2_results, while the numbers in "num_samples_w_kmer" column in the chi2_results.txt and the "No._of_samples_with_k-mer" column in the k-mers_and_coefficients_in_log_reg_model.txt do not match with each other, even for the kmers with identical sequences.

Am I getting something utterly wrong or missing something? It would be very much appreciated if you could walk me through why this could happen!

Many many thanks, Thomas

— Reply to this email directly, view it on GitHub https://github.com/bioinfo-ut/PhenotypeSeeker/issues/20, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEPVJKYC7F5QU6IEVZ5BMLUZKMQNANCNFSM5NPHIPXQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

rtnakano1984 commented 2 years ago

Hi Erki! Thanks a lot for your quick work, I confirm that it provides the right values in the log_reg_model results on my side as well. Thanks!!