Reason for status=fail / Lineage=None

tseemann commented 4 years ago

Was wondering what the main reason for the following fail situation might be?

It's only 8 SNPs and 0 indels from WUHAN-1 so was a bit surprised.

taxon,lineage,SH-alrt,UFbootstrap,lineages_version,status,note
XXXX,None,0,0,2020-04-27,fail,N_content:0.78

A       T       C       G       N       K       M       R       W       Y
8229    8852    5048    5401    2341    4       4       4       2       18

I notice N_content:0.78 is not a proportion or a percentage - out by factor of 10 ? It should be 7.8% N

2341/29903 = .07828645955255325552

https://github.com/hCoV-2019/pangolin/blob/ae8cf33001a7df33048e74c027562d2d9fcbee6b/pangolin/command.py#L99-L105

tseemann commented 4 years ago

Should it be len(record) instead of len(record.seq) ?

https://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html

aineniamh commented 4 years ago

To check your issue I have created a sequence with the exact same composition as your one above (it's in pangolin/test/test_seqs.fasta) and it passes the qc step (lineage_report.csv output shown below). Running the block of code that calculates proportion of N manually on it it prints out an N content of 0.08.

I'm not sure what's going on that you got N_content:0.78, I haven't changed any of the code. I wonder if it's something to do with gaps? But that should only lower N content, not inflate it.

taxon	lineage	SH-alrt	UFbootstrap	lineages_version	status	note
issue_57_torsten_seq	B.1.12	100	91	release_2020-04-27	passed_qc
This_seq_has_6000_Ns_in_18000_bases	B.1	100	100	release_2020-04-27	passed_qc
Japan/DP0779/2020	B	100	100	release_2020-04-27	passed_qc
USA/NY-NYUMC7/2020	B.1	100	100	release_2020-04-27	passed_qc
This_seq_has_no_seq	None	0	0	release_2020-04-27	fail	seq_len:0
This_seq_is_too_short	None	0	0	release_2020-04-27	fail	seq_len:2997
This_seq_has_lots_of_Ns	None	0	0	release_2020-04-27	fail	N_content:0.98
This_seq_is_literally_just_N	None	0	0	release_2020-04-27	fail	N_content:1.0

Should it be len(record) instead of len(record.seq)

That shouldn't make a difference, yes it would be more straighforward and I'll change it, but len(record) == len(record.seq).

num_N = str(record.seq).upper().count("N") 
prop_N = round((num_N)/len(record.seq), 2)

and

num_N = str(record.seq).upper().count("N") 
prop_N = round((num_N)/len(record), 2)

give the same answer.

aineniamh commented 4 years ago

Can you send me your sequence for debugging purposes and I'll look into it further.

aineniamh commented 4 years ago

Is it just that particular sequence reporting the weird value or is it across all query sequences in that particular file?

aineniamh commented 4 years ago

Hey @tseemann, I think I'll close this issue now as we can't seem to replicate it. If you ever come across it again, happy to reopen.

tseemann commented 4 years ago

@aineniamh sorry yes i haven't encountered it again but i did upgrade to git master regularly. it was weird had us flummoxed for a bit. closing is good.

cov-lineages / pangolin

Reason for status=fail / Lineage=None #57