BUG: coding_scores.csv for extract_coding isn't accurate for the number of frames per sample

olgabot commented 4 years ago

E.g. for this output, there's a total of 443,833 predicted proteins, but only 325,437 read IDs.

(diamond)
 Fri 21 Feb - 13:46  ~/data_sm/tabula-microcebus/analyses/kmermaid/blood_cross_species/protein_ksize7 
 olga@x86_64-conda_cos6-linux-gnu  bioawk -c fastx ' { print $name } ' extract_coding/J7_B000578_B009057_S223__coding_reads_peptides.fasta | wc -l
443833
(diamond)
 Fri 21 Feb - 13:46  ~/data_sm/tabula-microcebus/analyses/kmermaid/blood_cross_species/protein_ksize7 
 olga@x86_64-conda_cos6-linux-gnu  bioawk -c fastx ' { print $name } ' extract_coding/J7_B000578_B009057_S223__coding_reads_peptides.fasta | sort | uniq | wc -l
325437

Turns out some of these are because the R1 and R2 weren't treated differently, e.g. here there is one reading frame for A00111:133:H3VGJDSXX:3:2153:18738:34100 1:N:0:TTTGACAGGCTG+TCATTACATGAT but 6 (!!?!?) for A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT:

$ fgrep -A 1 'A00111:133:H3VGJDSXX:3:2153:18738:34100' $peptide_fasta
>A00111:133:H3VGJDSXX:3:2153:18738:34100 1:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: -3 jaccard: 1.0
ENLKAAQEEYVKRALANSLACQGKYTPSGQAG
--
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: 1 jaccard: 0.6666666666666666
GCGGGARRWRGGGGRRRGGRGGGGGGRRGGGGR
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: 2 jaccard: 1.0
GAGAARGGGAGGAGAGGGGGGGGGGGAGGGGAG
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: 3 jaccard: 0.6153846153846154
VRGRRAAVARGGRAPAGGAGGGGGGAPGGGGP
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: -1 jaccard: 0.7037037037037037
PGPPPPGAPPPPPPAPPAGARPPRATAARRPRT
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: -2 jaccard: 0.9565217391304348
PAPPPPAPPPPPPPPPPPAPAPPAPPPRAAPAP
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: -3 jaccard: 0.6923076923076923
RPPPPRRPPPPPPRPPRRRPPPPRHRRAPPPH

This makes the nucleotide sequence confusing as there's multiple jaccards:

>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: 1 jaccard: 0.6666666666666666
GGGTGCGGGGGCGGCGCGCGGCGGTGGCGCGGGGGGGGCGGGCGCCGGCGGGGGGGGCGGGGGGGGGGGGGGGGGGGGCGCCGGGGGGGGGGGGGCCGGG
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: 2 jaccard: 1.0
GGGTGCGGGGGCGGCGCGCGGCGGTGGCGCGGGGGGGGCGGGCGCCGGCGGGGGGGGCGGGGGGGGGGGGGGGGGGGGCGCCGGGGGGGGGGGGGCCGGG
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: 3 jaccard: 0.6153846153846154
GGGTGCGGGGGCGGCGCGCGGCGGTGGCGCGGGGGGGGCGGGCGCCGGCGGGGGGGGCGGGGGGGGGGGGGGGGGGGGCGCCGGGGGGGGGGGGGCCGGG
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: -1 jaccard: 0.7037037037037037
GGGTGCGGGGGCGGCGCGCGGCGGTGGCGCGGGGGGGGCGGGCGCCGGCGGGGGGGGCGGGGGGGGGGGGGGGGGGGGCGCCGGGGGGGGGGGGGCCGGG
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: -2 jaccard: 0.9565217391304348
GGGTGCGGGGGCGGCGCGCGGCGGTGGCGCGGGGGGGGCGGGCGCCGGCGGGGGGGGCGGGGGGGGGGGGGGGGGGGGCGCCGGGGGGGGGGGGGCCGGG
>A00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT translation_frame: -3 jaccard: 0.6923076923076923
GGGTGCGGGGGCGGCGCGCGGCGGTGGCGCGGGGGGGGCGGGCGCCGGCGGGGGGGGCGGGGGGGGGGGGGGGGGGGGCGCCGGGGGGGGGGGGGCCGGG

olgabot commented 4 years ago

e.g. this line: https://github.com/czbiohub/kh-tools/blob/9a74448056a750d98f0d85a01113a3ac8d42c2c0/khtools/extract_coding.py#L320 is output for every iteration of the for loop, but the function should really be yield-ing the max_fraction_in_peptide_db, max_n_kmers, None for each frame of the sequence

pranathivemuri commented 4 years ago

https://github.com/czbiohub/kh-tools/tree/pranathi-bug-ec - working on it in this branch

pranathivemuri commented 4 years ago

hey @olgabot here are the tests that are failing

================================================================================================= FAILURES ================================================================================================= _____ test_score_reads[protein_default_ksize] __

capsys = <_pytest.capture.CaptureFixture object at 0x1a127dce90> tmpdir = local('/private/var/folders/8m/_k_qns7d785cp8k976_m1bw40000gr/T/pytest-of-pranathivemuri/pytest-14/test_score_reads_protein_defau0') reads = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq', peptide_bloom_filter = <khmer._oxli.graphs.Nodegraph object at 0x1a124721d0> molecule = 'protein' true_scores = read_id jaccard_in_peptide_db n_kmers ... adversarial_low_complexity_peptide NaN 1.0 Low complexity peptide in protein encoding true_scores_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/extract_coding/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22__molecule-protein_ksize-7.csv' true_protein_coding_fasta_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/extract_coding/true_protein_coding.fasta'

def test_score_reads(capsys, tmpdir, reads, peptide_bloom_filter, molecule,
                     true_scores, true_scores_path,
                     true_protein_coding_fasta_path):
    from khtools.extract_coding import score_reads

    test = score_reads(reads,
                       peptide_bloom_filter,
                       molecule=molecule)

    # Check that scoring was the same

  pdt.assert_equal(test, true_scores)

tests/test_extract_coding.py:157:

pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal ???

??? E AssertionError: DataFrame.iloc[:, 1] are different E
E DataFrame.iloc[:, 1] values are different (17.3913 %) E [left]: [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.058823529411764705, 0.0, 0.0625, 0.125, 0.058823529411764705, 0.0, 0.0625, 1.0, 0.0625, 0.0, 0.0, 0.0, 0.1111111111111111, 0.17647058823529413, nan, nan, 0.0] E [right]: [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.05882352941176471, 0.0, 0.0625, 0.125, 0.05882352941176471, nan, 0.0625, 1.0, 0.0625, nan, nan, 0.0, 0.1111111111111111, 0.17647058823529413, nan, nan, nan]

pandas/_libs/testing.pyx:178: AssertionError ------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------

SRR306838.10559374 Ibis_Run100924_C3PO:6:51:17601:17119/1 translation_frame: -2 jaccard: 1.0 TEQDLQLYCDFPNIIDVSIKQA SRR306838.2740879 Ibis_Run100924_C3PO:6:13:11155:5248/1 translation_frame: -1 jaccard: 1.0 QSSSPEFRVQSFSERTNARKKNNH SRR306838.4880582 Ibis_Run100924_C3PO:6:23:17413:5436/1 translationframe: 2 jaccard: 1.0 LDPPYSRVITQRETENNQMTSE ------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------- 23it [00:00, 2023.43it/s] ____ test_score_reads[dayhoff_default_ksize] __

capsys = <_pytest.capture.CaptureFixture object at 0x1a127dfbd0> tmpdir = local('/private/var/folders/8m/_k_qns7d785cp8k976_m1bw40000gr/T/pytest-of-pranathivemuri/pytest-14/test_score_reads_dayhoff_defau0') reads = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq', peptide_bloom_filter = <khmer._oxli.graphs.Nodegraph object at 0x1082f5150> molecule = 'dayhoff' true_scores = read_id jaccard_in_peptide_db n_kmers ... adversarial_low_complexity_peptide NaN 1.0 Low complexity peptide in dayhoff encoding true_scores_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/extract_coding/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22__molecule-dayhoff_ksize-12.csv' true_protein_coding_fasta_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/extract_coding/true_protein_coding.fasta'

def test_score_reads(capsys, tmpdir, reads, peptide_bloom_filter, molecule,
                     true_scores, true_scores_path,
                     true_protein_coding_fasta_path):
    from khtools.extract_coding import score_reads

    test = score_reads(reads,
                       peptide_bloom_filter,
                       molecule=molecule)

    # Check that scoring was the same

  pdt.assert_equal(test, true_scores)

tests/test_extract_coding.py:157:

pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal ???

??? E AssertionError: DataFrame.iloc[:, 1] are different E
E DataFrame.iloc[:, 1] values are different (21.73913 %) E [left]: [1.0, 0.08333333333333333, 0.09090909090909091, 0.0, 0.0, 1.0, 0.0, 0.0, 0.09090909090909091, 0.0, 0.08333333333333333, 0.0, 0.09090909090909091, 1.0, 0.0, 0.0, 0.0, 0.0, 0.07692307692307693, 0.0, nan, nan, 0.0] E [right]: [1.0, 0.08333333333333333, 0.09090909090909093, 0.0, 0.0, 1.0, 0.0, 0.0, 0.09090909090909093, 0.0, 0.08333333333333333, nan, 0.09090909090909093, 1.0, 0.0, nan, nan, nan, 0.07692307692307693, 0.0, nan, nan, nan]

pandas/_libs/testing.pyx:178: AssertionError ------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------

SRR306838.10559374 Ibis_Run100924_C3PO:6:51:17601:17119/1 translation_frame: -2 jaccard: 1.0 TEQDLQLYCDFPNIIDVSIKQA SRR306838.2740879 Ibis_Run100924_C3PO:6:13:11155:5248/1 translation_frame: -1 jaccard: 1.0 QSSSPEFRVQSFSERTNARKKNNH SRR306838.4880582 Ibis_Run100924_C3PO:6:23:17413:5436/1 translation_frame: 2 jaccard: 1.0 LDPPYSRVITQRETENNQMTSE ------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------- 23it [00:00, 2216.25it/s] ___ test_cli_csv[protein_default_ksize] ____

tmpdir = local('/private/var/folders/8m/_k_qns7d785cp8k976_m1bw40000gr/T/pytest-of-pranathivemuri/pytest-14/test_cli_csv_protein_default_k0') reads = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq' peptide_bloom_filter_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/bloom_filter/Homo_sapiens.GRCh38.pep.subset.molecule-protein_ksize-7.bloomfilter.nodegraph', molecule = 'protein' peptide_ksize = 7 true_protein_coding_fasta_string = '>SRR306838.10559374 Ibis_Run100924_C3PO:6:51:17601:17119/1 translation_frame: -2 jaccard: 1.0\nTEQDLQLYCDFPNIIDVSIKQA...\n>SRR306838.4880582 Ibis_Run100924_C3PO:6:23:17413:5436/1 translation_frame: 2 jaccard: 1.0\nLDPPYSRVITQRETENNQMTSE\n' true_scores = read_id jaccard_in_peptide_db n_kmers ... adversarial_low_complexity_peptide NaN 1.0 Low complexity peptide in protein encoding

def test_cli_csv(tmpdir, reads, peptide_bloom_filter_path, molecule,
                 peptide_ksize, true_protein_coding_fasta_string, true_scores):
    from khtools.extract_coding import cli

    csv = os.path.join(tmpdir, 'coding_scores.csv')

    runner = CliRunner()
    result = runner.invoke(cli, [
        '--peptide-ksize', peptide_ksize, "--csv", csv,
        "--peptides-are-bloom-filter", '--molecule', molecule,
        peptide_bloom_filter_path, reads
    ])
    assert result.exit_code == 0
    assert true_protein_coding_fasta_string in result.output
    assert os.path.exists(csv)

    # the CLI adds the filename to the scoring dataframe
    true = true_scores.copy()
    true['filename'] = reads

    test_scores = pd.read_csv(csv)

  pdt.assert_equal(test_scores, true)

tests/test_extract_coding.py:273:

pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal ???

??? E AssertionError: DataFrame.iloc[:, 1] are different E
E DataFrame.iloc[:, 1] values are different (17.3913 %) E [left]: [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.05882352941176471, 0.0, 0.0625, 0.125, 0.05882352941176471, 0.0, 0.0625, 1.0, 0.0625, 0.0, 0.0, 0.0, 0.1111111111111111, 0.17647058823529413, nan, nan, 0.0] E [right]: [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.05882352941176471, 0.0, 0.0625, 0.125, 0.05882352941176471, nan, 0.0625, 1.0, 0.0625, nan, nan, 0.0, 0.1111111111111111, 0.17647058823529413, nan, nan, nan]

pandas/_libs/testing.pyx:178: AssertionError ___ test_cli_csv[dayhoff_default_ksize] ____

tmpdir = local('/private/var/folders/8m/_k_qns7d785cp8k976_m1bw40000gr/T/pytest-of-pranathivemuri/pytest-14/test_cli_csv_dayhoff_default_k0') reads = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq' peptide_bloom_filter_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/bloom_filter/Homo_sapiens.GRCh38.pep.subset.molecule-dayhoff_ksize-12.bloomfilter.nodegraph', molecule = 'dayhoff' peptide_ksize = 12 true_protein_coding_fasta_string = '>SRR306838.10559374 Ibis_Run100924_C3PO:6:51:17601:17119/1 translation_frame: -2 jaccard: 1.0\nTEQDLQLYCDFPNIIDVSIKQA...\n>SRR306838.4880582 Ibis_Run100924_C3PO:6:23:17413:5436/1 translation_frame: 2 jaccard: 1.0\nLDPPYSRVITQRETENNQMTSE\n' true_scores = read_id jaccard_in_peptide_db n_kmers ... adversarial_low_complexity_peptide NaN 1.0 Low complexity peptide in dayhoff encoding

def test_cli_csv(tmpdir, reads, peptide_bloom_filter_path, molecule,
                 peptide_ksize, true_protein_coding_fasta_string, true_scores):
    from khtools.extract_coding import cli

    csv = os.path.join(tmpdir, 'coding_scores.csv')

    runner = CliRunner()
    result = runner.invoke(cli, [
        '--peptide-ksize', peptide_ksize, "--csv", csv,
        "--peptides-are-bloom-filter", '--molecule', molecule,
        peptide_bloom_filter_path, reads
    ])
    assert result.exit_code == 0
    assert true_protein_coding_fasta_string in result.output
    assert os.path.exists(csv)

    # the CLI adds the filename to the scoring dataframe
    true = true_scores.copy()
    true['filename'] = reads

    test_scores = pd.read_csv(csv)

  pdt.assert_equal(test_scores, true)

tests/test_extract_coding.py:273:

pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal ???

??? E AssertionError: DataFrame.iloc[:, 1] are different E
E DataFrame.iloc[:, 1] values are different (21.73913 %) E [left]: [1.0, 0.08333333333333333, 0.09090909090909093, 0.0, 0.0, 1.0, 0.0, 0.0, 0.09090909090909093, 0.0, 0.08333333333333333, 0.0, 0.09090909090909093, 1.0, 0.0, 0.0, 0.0, 0.0, 0.07692307692307693, 0.0, nan, nan, 0.0] E [right]: [1.0, 0.08333333333333333, 0.09090909090909093, 0.0, 0.0, 1.0, 0.0, 0.0, 0.09090909090909093, 0.0, 0.08333333333333333, nan, 0.09090909090909093, 1.0, 0.0, nan, nan, nan, 0.07692307692307693, 0.0, nan, nan, nan]

pandas/_libs/testing.pyx:178: AssertionError

pranathivemuri commented 4 years ago

so, right now maybe_score_single_Read calls score_single_read that yields a list do you want that? https://github.com/czbiohub/kh-tools/blob/pranathi-bug-ec/khtools/extract_coding.py#L374

is that okay, if so I will change line 374 above

I have already changed this line - https://github.com/czbiohub/kh-tools/blob/pranathi-bug-ec/khtools/extract_coding.py#L415

olgabot commented 4 years ago

RE failing tests -- feel free to replace the true_scores_path fixture paths with the new ones. There will necessarily be different coding score values as this is yielding far more rows than the previous version.

olgabot commented 4 years ago

And thank you!!

olgabot commented 4 years ago

Hi Pranathi, Can you make a draft PR of your branch so it's easier to comment line by line there? Thanks! Olga

On Wed, Feb 26, 2020, 17:32 Pranathi Vemuri notifications@github.com wrote:

so, right now maybe_score_single_Read calls score_single_read that yields a list do you want that?

https://github.com/czbiohub/kh-tools/blob/pranathi-bug-ec/khtools/extract_coding.py#L374

is that okay, if so I will change line 374 above

I have already changed this line - https://github.com/czbiohub/kh-tools/blob/pranathi-bug-ec/khtools/extract_coding.py#L415

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/czbiohub/kh-tools/issues/30?email_source=notifications&email_token=AAGE24ELCOTLEOIBVHLQDCLRE4JZ5A5CNFSM4KZLI4S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENCRVDI#issuecomment-591731341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGE24CQAC3PEL2OBRXCBDLRE4JZ5ANCNFSM4KZLI4SQ .

olgabot commented 4 years ago

Addressed here: https://github.com/czbiohub/kh-tools/pull/35

olgabot commented 4 years ago

@bluegenes -- from your simulated datasets, did you have some DNA sequences that are known to have multiple reading frames? That would be very helpful for testing here!

pranathivemuri commented 4 years ago

https://github.com/czbiohub/kh-tools/pull/56#pullrequestreview-389938441 - this bug is actually being addresses in this PR

pranathivemuri commented 4 years ago

fixed in https://github.com/czbiohub/kh-tools/pull/57

czbiohub-sf / orpheum

BUG: coding_scores.csv for extract_coding isn't accurate for the number of frames per sample #30