Count generation script (process_new.py) index error

When running count generation on a set of fastq files, the script fails after processing a subset of fastq files and must be run again, without changing the files generated or the config file. The script will then finish running successfully. The error message: ######## 2022-08-05 16:00:54,592 INFO: found 2558958/10392813 tags (24.62%) /home/ubuntu/s3-drive/code/analysis_new.py:312: UserWarning: Boolean Series key will be reindexed to match DataFrame index. lib_subset = totals[self.results['lib_id'] == lib_id] Traceback (most recent call last): File "/home/ubuntu/s3-drive/code/process_new.py", line 42, in main(config) File "/home/ubuntu/s3-drive/code/process_new.py", line 14, in main analysis.run_counts() File "/home/ubuntu/s3-drive/code/analysis_new.py", line 312, in run_counts lib_subset = totals[self.results['lib_id'] == lib_id] File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3496, in getitem return self._getitem_bool_array(key) File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 3549, in _getitem_bool_array key = check_bool_indexer(self.index, key) File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/core/indexing.py", line 2383, in check_bool_indexer raise IndexingError( pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

The offending line seems to be: lib_subset = totals[self.results['lib_id'] == lib_id] Which implies that the length of the dataframe "self" is different from the list "totals". I suspect that one of the output files is either: 1) not created yet but expected to exist 2) created but has missing data that is needed

Further debugging warranted. Even if the bug is fairly neutral, starting the script over is clumsy and the logfile becomes sloppy.

This issue is resolved. Inside of analysis_new.py in the Connor code: The file with library info/compounds was truncated to only libraries of interest on line 31: self.df_cpds = self.df_cpds[self.df_cpds['lib_id'].isin(library_ids)]This shortened the file, but the index column of the dataframe retained it's old numbering pattern even when it's reassigned to the self.results dataframe on line 344: self.results = self.library.df_cpds.copy()Thus, line 312: lib_subset = totals[self.results['lib_id'] == lib_id] fails due to index mismatch between totals (index starts at 0) and self.results (index starts much higher). However, if you re-run the script, self.results loads the truncated library information directly from save_file.csv, setting the starting index of self.results to 0. Now the code runs without error. Therefore, the index of self.results should be re-initialized. Changing line 344 to: self.results = self.library.df_cpds.copy().reset_index(drop=True) allows the script to run correctly the first time. I tested on a MiSeq dataset due to it's small size and fast run-time.

broadinstitute / chem-bio-dos-del

Count generation script (process_new.py) index error #14