Closed olgabot closed 4 years ago
e.g. this line: https://github.com/czbiohub/kh-tools/blob/9a74448056a750d98f0d85a01113a3ac8d42c2c0/khtools/extract_coding.py#L320
is output for every iteration of the for
loop, but the function should really be yield
-ing the max_fraction_in_peptide_db, max_n_kmers, None
for each frame of the sequence
https://github.com/czbiohub/kh-tools/tree/pranathi-bug-ec - working on it in this branch
hey @olgabot here are the tests that are failing
================================================================================================= FAILURES ================================================================================================= _____ test_score_reads[protein_default_ksize] __
capsys = <_pytest.capture.CaptureFixture object at 0x1a127dce90> tmpdir = local('/private/var/folders/8m/_k_qns7d785cp8k976_m1bw40000gr/T/pytest-of-pranathivemuri/pytest-14/test_score_reads_protein_defau0') reads = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq', peptide_bloom_filter = <khmer._oxli.graphs.Nodegraph object at 0x1a124721d0> molecule = 'protein' true_scores = read_id jaccard_in_peptide_db n_kmers ... adversarial_low_complexity_peptide NaN 1.0 Low complexity peptide in protein encoding true_scores_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/extract_coding/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22__molecule-protein_ksize-7.csv' true_protein_coding_fasta_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/extract_coding/true_protein_coding.fasta'
def test_score_reads(capsys, tmpdir, reads, peptide_bloom_filter, molecule,
true_scores, true_scores_path,
true_protein_coding_fasta_path):
from khtools.extract_coding import score_reads
test = score_reads(reads,
peptide_bloom_filter,
molecule=molecule)
# Check that scoring was the same
pdt.assert_equal(test, true_scores)
tests/test_extract_coding.py:157:
pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal ???
??? E AssertionError: DataFrame.iloc[:, 1] are different E
E DataFrame.iloc[:, 1] values are different (17.3913 %) E [left]: [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.058823529411764705, 0.0, 0.0625, 0.125, 0.058823529411764705, 0.0, 0.0625, 1.0, 0.0625, 0.0, 0.0, 0.0, 0.1111111111111111, 0.17647058823529413, nan, nan, 0.0] E [right]: [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.05882352941176471, 0.0, 0.0625, 0.125, 0.05882352941176471, nan, 0.0625, 1.0, 0.0625, nan, nan, 0.0, 0.1111111111111111, 0.17647058823529413, nan, nan, nan]
pandas/_libs/testing.pyx:178: AssertionError ------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------
SRR306838.10559374 Ibis_Run100924_C3PO:6:51:17601:17119/1 translation_frame: -2 jaccard: 1.0 TEQDLQLYCDFPNIIDVSIKQA SRR306838.2740879 Ibis_Run100924_C3PO:6:13:11155:5248/1 translation_frame: -1 jaccard: 1.0 QSSSPEFRVQSFSERTNARKKNNH SRR306838.4880582 Ibis_Run100924_C3PO:6:23:17413:5436/1 translationframe: 2 jaccard: 1.0 LDPPYSRVITQRETENNQMTSE ------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------- 23it [00:00, 2023.43it/s] ____ test_score_reads[dayhoff_default_ksize] __
capsys = <_pytest.capture.CaptureFixture object at 0x1a127dfbd0> tmpdir = local('/private/var/folders/8m/_k_qns7d785cp8k976_m1bw40000gr/T/pytest-of-pranathivemuri/pytest-14/test_score_reads_dayhoff_defau0') reads = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq', peptide_bloom_filter = <khmer._oxli.graphs.Nodegraph object at 0x1082f5150> molecule = 'dayhoff' true_scores = read_id jaccard_in_peptide_db n_kmers ... adversarial_low_complexity_peptide NaN 1.0 Low complexity peptide in dayhoff encoding true_scores_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/extract_coding/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22__molecule-dayhoff_ksize-12.csv' true_protein_coding_fasta_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/extract_coding/true_protein_coding.fasta'
def test_score_reads(capsys, tmpdir, reads, peptide_bloom_filter, molecule,
true_scores, true_scores_path,
true_protein_coding_fasta_path):
from khtools.extract_coding import score_reads
test = score_reads(reads,
peptide_bloom_filter,
molecule=molecule)
# Check that scoring was the same
pdt.assert_equal(test, true_scores)
tests/test_extract_coding.py:157:
pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal ???
??? E AssertionError: DataFrame.iloc[:, 1] are different E
E DataFrame.iloc[:, 1] values are different (21.73913 %) E [left]: [1.0, 0.08333333333333333, 0.09090909090909091, 0.0, 0.0, 1.0, 0.0, 0.0, 0.09090909090909091, 0.0, 0.08333333333333333, 0.0, 0.09090909090909091, 1.0, 0.0, 0.0, 0.0, 0.0, 0.07692307692307693, 0.0, nan, nan, 0.0] E [right]: [1.0, 0.08333333333333333, 0.09090909090909093, 0.0, 0.0, 1.0, 0.0, 0.0, 0.09090909090909093, 0.0, 0.08333333333333333, nan, 0.09090909090909093, 1.0, 0.0, nan, nan, nan, 0.07692307692307693, 0.0, nan, nan, nan]
pandas/_libs/testing.pyx:178: AssertionError ------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------
SRR306838.10559374 Ibis_Run100924_C3PO:6:51:17601:17119/1 translation_frame: -2 jaccard: 1.0 TEQDLQLYCDFPNIIDVSIKQA SRR306838.2740879 Ibis_Run100924_C3PO:6:13:11155:5248/1 translation_frame: -1 jaccard: 1.0 QSSSPEFRVQSFSERTNARKKNNH SRR306838.4880582 Ibis_Run100924_C3PO:6:23:17413:5436/1 translation_frame: 2 jaccard: 1.0 LDPPYSRVITQRETENNQMTSE ------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------- 23it [00:00, 2216.25it/s] ___ test_cli_csv[protein_default_ksize] ____
tmpdir = local('/private/var/folders/8m/_k_qns7d785cp8k976_m1bw40000gr/T/pytest-of-pranathivemuri/pytest-14/test_cli_csv_protein_default_k0') reads = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq' peptide_bloom_filter_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/bloom_filter/Homo_sapiens.GRCh38.pep.subset.molecule-protein_ksize-7.bloomfilter.nodegraph', molecule = 'protein' peptide_ksize = 7 true_protein_coding_fasta_string = '>SRR306838.10559374 Ibis_Run100924_C3PO:6:51:17601:17119/1 translation_frame: -2 jaccard: 1.0\nTEQDLQLYCDFPNIIDVSIKQA...\n>SRR306838.4880582 Ibis_Run100924_C3PO:6:23:17413:5436/1 translation_frame: 2 jaccard: 1.0\nLDPPYSRVITQRETENNQMTSE\n' true_scores = read_id jaccard_in_peptide_db n_kmers ... adversarial_low_complexity_peptide NaN 1.0 Low complexity peptide in protein encoding
def test_cli_csv(tmpdir, reads, peptide_bloom_filter_path, molecule,
peptide_ksize, true_protein_coding_fasta_string, true_scores):
from khtools.extract_coding import cli
csv = os.path.join(tmpdir, 'coding_scores.csv')
runner = CliRunner()
result = runner.invoke(cli, [
'--peptide-ksize', peptide_ksize, "--csv", csv,
"--peptides-are-bloom-filter", '--molecule', molecule,
peptide_bloom_filter_path, reads
])
assert result.exit_code == 0
assert true_protein_coding_fasta_string in result.output
assert os.path.exists(csv)
# the CLI adds the filename to the scoring dataframe
true = true_scores.copy()
true['filename'] = reads
test_scores = pd.read_csv(csv)
pdt.assert_equal(test_scores, true)
tests/test_extract_coding.py:273:
pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal ???
??? E AssertionError: DataFrame.iloc[:, 1] are different E
E DataFrame.iloc[:, 1] values are different (17.3913 %) E [left]: [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.05882352941176471, 0.0, 0.0625, 0.125, 0.05882352941176471, 0.0, 0.0625, 1.0, 0.0625, 0.0, 0.0, 0.0, 0.1111111111111111, 0.17647058823529413, nan, nan, 0.0] E [right]: [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.05882352941176471, 0.0, 0.0625, 0.125, 0.05882352941176471, nan, 0.0625, 1.0, 0.0625, nan, nan, 0.0, 0.1111111111111111, 0.17647058823529413, nan, nan, nan]
pandas/_libs/testing.pyx:178: AssertionError ___ test_cli_csv[dayhoff_default_ksize] ____
tmpdir = local('/private/var/folders/8m/_k_qns7d785cp8k976_m1bw40000gr/T/pytest-of-pranathivemuri/pytest-14/test_cli_csv_dayhoff_default_k0') reads = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/SRR306838_GSM752691_hsa_br_F_1_trimmed_subsampled_n22.fq' peptide_bloom_filter_path = '/Users/pranathivemuri/czbiohub/kh-tools/tests/./data/bloom_filter/Homo_sapiens.GRCh38.pep.subset.molecule-dayhoff_ksize-12.bloomfilter.nodegraph', molecule = 'dayhoff' peptide_ksize = 12 true_protein_coding_fasta_string = '>SRR306838.10559374 Ibis_Run100924_C3PO:6:51:17601:17119/1 translation_frame: -2 jaccard: 1.0\nTEQDLQLYCDFPNIIDVSIKQA...\n>SRR306838.4880582 Ibis_Run100924_C3PO:6:23:17413:5436/1 translation_frame: 2 jaccard: 1.0\nLDPPYSRVITQRETENNQMTSE\n' true_scores = read_id jaccard_in_peptide_db n_kmers ... adversarial_low_complexity_peptide NaN 1.0 Low complexity peptide in dayhoff encoding
def test_cli_csv(tmpdir, reads, peptide_bloom_filter_path, molecule,
peptide_ksize, true_protein_coding_fasta_string, true_scores):
from khtools.extract_coding import cli
csv = os.path.join(tmpdir, 'coding_scores.csv')
runner = CliRunner()
result = runner.invoke(cli, [
'--peptide-ksize', peptide_ksize, "--csv", csv,
"--peptides-are-bloom-filter", '--molecule', molecule,
peptide_bloom_filter_path, reads
])
assert result.exit_code == 0
assert true_protein_coding_fasta_string in result.output
assert os.path.exists(csv)
# the CLI adds the filename to the scoring dataframe
true = true_scores.copy()
true['filename'] = reads
test_scores = pd.read_csv(csv)
pdt.assert_equal(test_scores, true)
tests/test_extract_coding.py:273:
pandas/_libs/testing.pyx:65: in pandas._libs.testing.assert_almost_equal ???
??? E AssertionError: DataFrame.iloc[:, 1] are different E
E DataFrame.iloc[:, 1] values are different (21.73913 %) E [left]: [1.0, 0.08333333333333333, 0.09090909090909093, 0.0, 0.0, 1.0, 0.0, 0.0, 0.09090909090909093, 0.0, 0.08333333333333333, 0.0, 0.09090909090909093, 1.0, 0.0, 0.0, 0.0, 0.0, 0.07692307692307693, 0.0, nan, nan, 0.0] E [right]: [1.0, 0.08333333333333333, 0.09090909090909093, 0.0, 0.0, 1.0, 0.0, 0.0, 0.09090909090909093, 0.0, 0.08333333333333333, nan, 0.09090909090909093, 1.0, 0.0, nan, nan, nan, 0.07692307692307693, 0.0, nan, nan, nan]
pandas/_libs/testing.pyx:178: AssertionError
so, right now maybe_score_single_Read calls score_single_read that yields a list do you want that? https://github.com/czbiohub/kh-tools/blob/pranathi-bug-ec/khtools/extract_coding.py#L374
is that okay, if so I will change line 374 above
I have already changed this line - https://github.com/czbiohub/kh-tools/blob/pranathi-bug-ec/khtools/extract_coding.py#L415
RE failing tests -- feel free to replace the true_scores_path
fixture paths with the new ones. There will necessarily be different coding score values as this is yield
ing far more rows than the previous version.
And thank you!!
Hi Pranathi, Can you make a draft PR of your branch so it's easier to comment line by line there? Thanks! Olga
On Wed, Feb 26, 2020, 17:32 Pranathi Vemuri notifications@github.com wrote:
so, right now maybe_score_single_Read calls score_single_read that yields a list do you want that?
https://github.com/czbiohub/kh-tools/blob/pranathi-bug-ec/khtools/extract_coding.py#L374
is that okay, if so I will change line 374 above
I have already changed this line - https://github.com/czbiohub/kh-tools/blob/pranathi-bug-ec/khtools/extract_coding.py#L415
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/czbiohub/kh-tools/issues/30?email_source=notifications&email_token=AAGE24ELCOTLEOIBVHLQDCLRE4JZ5A5CNFSM4KZLI4S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENCRVDI#issuecomment-591731341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGE24CQAC3PEL2OBRXCBDLRE4JZ5ANCNFSM4KZLI4SQ .
Addressed here: https://github.com/czbiohub/kh-tools/pull/35
@bluegenes -- from your simulated datasets, did you have some DNA sequences that are known to have multiple reading frames? That would be very helpful for testing here!
https://github.com/czbiohub/kh-tools/pull/56#pullrequestreview-389938441 - this bug is actually being addresses in this PR
E.g. for this output, there's a total of 443,833 predicted proteins, but only 325,437 read IDs.
Turns out some of these are because the R1 and R2 weren't treated differently, e.g. here there is one reading frame for
A00111:133:H3VGJDSXX:3:2153:18738:34100 1:N:0:TTTGACAGGCTG+TCATTACATGAT
but 6 (!!?!?) forA00111:133:H3VGJDSXX:3:2153:18738:34100 2:N:0:TTTGACAGGCTG+TCATTACATGAT
:This makes the nucleotide sequence confusing as there's multiple jaccards: