MaWeffm / ReCo

ReCo: automated NGS read-counting of single and combinatorial CRISPR gRNAs.
MIT License
1 stars 3 forks source link

Invalid sequences found ( min() arg is an empty sequence) #2

Closed cpfeifer58 closed 1 year ago

cpfeifer58 commented 1 year ago

Hello!

When ReCo accesses a library for paired sample type, I encounter an error where all of my samples are considered invalid. Below is the error presented:

2023-08-29 14:05:07 WARNING: Duplicate(s) found in column 'guide': [362, 365, 1006, 1008, 1033, 1034, 1477, 1931, 1934, 1935, 1936, 2128, 2129, 2130] (14). Will keep the first occurrence!
2023-08-29 14:05:07 WARNING: Invalid sequences found: 2810.
2023-08-29 14:05:07 WARNING:                      guide sequence dna_seq
0     ACAGTGTAAAACCCTTAGAG  ABCC5_1    text
1     AGCACCAAGCAAGCTGCAGG  ABCC5_2    text
2     CTCTGCTCGAGGGCCTTTTG  ABCC5_3    text
3     CTCGTTACACATCTCCTCGG  ABCC5_4    text
4     CCCCGAGGAGATGTGTAACG  ABCC5_5    text
...                    ...      ...     ...
2812  TCTTTGAAATGAGAAAGAAA  ZZEF1_5    text
2813  CAAAGATTCTCAATATATTA  ZZEF1_6    text
2814  CAGGCATCGATTACATTGTG  ZZEF1_7    text
2815  CACACAATGTAATCGATGCC  ZZEF1_8    text
2816  GATGCCTGGAGTGAGGTGCA  ZZEF1_9    text

[2810 rows x 3 columns]
Traceback (most recent call last):
  File "/reco/ReCo.py", line 111, in <module>
    main()
  File "/reco/ReCo.py", line 102, in main
    r = ReCo(
  File "/reco/reco/reco.py", line 47, in __init__
    self.sample_sheet = SampleSheet.from_file(
  File "/reco/reco/sample_sheet.py", line 51, in from_file
    s_sheet.read_sample_sheet_file()
  File "/reco/reco/sample_sheet.py", line 85, in read_sample_sheet_file
    self.create_samples()
  File "/reco/reco/sample_sheet.py", line 108, in create_samples
    self.samples[sample_counter] = PairedSample(
  File "/reco/reco/sample.py", line 476, in __init__
    self.lib_1 = Library(logger=self.logger, library_file=self.lib_file_1)
  File "/reco/reco/library.py", line 55, in __init__
    self.library_file = library_file
  File "/reco/reco/library.py", line 92, in library_file
    self.lib_df, self.sequence_length = self.read_library_file(
  File "/reco/reco/library.py", line 128, in read_library_file
    cleaned_library_df, sequence_length = check_sequence_lengths(
  File "/reco/reco/library.py", line 309, in check_sequence_lengths
    "shortest": min(length_counter.keys()),
ValueError: min() arg is an empty sequence

I'm including my library file as well in case it is formatted incorrectly, however is there a minimum length for sequences? In generating the library file I kept only the uniquely identifying portions of the gRNA.

unirecoLib1.csv

Please let me know if I can include any more information!

MaWeffm commented 1 year ago

Hi @cpfeifer58!

Regarding the minimum length of the sequences: there is currently no minimum length, but I take your question as an incentive to include an additional sanity check on the sequence lengths. Thank you!

It seems your library file is not correctly set up. The first column should contain a unique gRNA identifier, and the second column should contain the actual gRNA sequence. Could you please try the file below and let me know if it works?

unirecoLib1.csv

cpfeifer58 commented 1 year ago

Worked like a charm! Thanks!