Closed tseemann closed 4 years ago
Should it be len(record)
instead of len(record.seq)
?
https://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html
To check your issue I have created a sequence with the exact same composition as your one above (it's in pangolin/test/test_seqs.fasta
) and it passes the qc step (lineage_report.csv
output shown below). Running the block of code that calculates proportion of N manually on it it prints out an N content of 0.08.
I'm not sure what's going on that you got N_content:0.78
, I haven't changed any of the code. I wonder if it's something to do with gaps? But that should only lower N content, not inflate it.
taxon | lineage | SH-alrt | UFbootstrap | lineages_version | status | note |
---|---|---|---|---|---|---|
issue_57_torsten_seq | B.1.12 | 100 | 91 | release_2020-04-27 | passed_qc | |
This_seq_has_6000_Ns_in_18000_bases | B.1 | 100 | 100 | release_2020-04-27 | passed_qc | |
Japan/DP0779/2020 | B | 100 | 100 | release_2020-04-27 | passed_qc | |
USA/NY-NYUMC7/2020 | B.1 | 100 | 100 | release_2020-04-27 | passed_qc | |
This_seq_has_no_seq | None | 0 | 0 | release_2020-04-27 | fail | seq_len:0 |
This_seq_is_too_short | None | 0 | 0 | release_2020-04-27 | fail | seq_len:2997 |
This_seq_has_lots_of_Ns | None | 0 | 0 | release_2020-04-27 | fail | N_content:0.98 |
This_seq_is_literally_just_N | None | 0 | 0 | release_2020-04-27 | fail | N_content:1.0 |
Should it be len(record) instead of len(record.seq)
That shouldn't make a difference, yes it would be more straighforward and I'll change it, but len(record) == len(record.seq).
num_N = str(record.seq).upper().count("N")
prop_N = round((num_N)/len(record.seq), 2)
and
num_N = str(record.seq).upper().count("N")
prop_N = round((num_N)/len(record), 2)
give the same answer.
Can you send me your sequence for debugging purposes and I'll look into it further.
Is it just that particular sequence reporting the weird value or is it across all query sequences in that particular file?
Hey @tseemann, I think I'll close this issue now as we can't seem to replicate it. If you ever come across it again, happy to reopen.
@aineniamh sorry yes i haven't encountered it again but i did upgrade to git master regularly. it was weird had us flummoxed for a bit. closing is good.
Was wondering what the main reason for the following fail situation might be?
It's only 8 SNPs and 0 indels from WUHAN-1 so was a bit surprised.
I notice
N_content:0.78
is not a proportion or a percentage - out by factor of 10 ? It should be 7.8% Nhttps://github.com/hCoV-2019/pangolin/blob/ae8cf33001a7df33048e74c027562d2d9fcbee6b/pangolin/command.py#L99-L105