Version issue Biopython when importing Bio.SeqUtils.GC

Boehmin commented 10 months ago

I just freshly set up the environment and the version of biopython installed was 1.83 where GC has changed to gc_fraction (since 1.80) - I received the below error and had to force install biopython 1.79 which helped resolve the issue. It did however tell me matplotlib was not available (which might also be unrelated?).

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 4
      2 import Bio
      3 from Bio import SeqIO
----> 4 from Bio.SeqUtils import GC
      5 from Bio.Seq import Seq
      6 from Bio.SeqRecord import SeqRecord

ImportError: cannot import name 'GC' from 'Bio.SeqUtils' (/scicore/home/rueegg/boehm0002/miniconda3/envs/probedesign/lib/python3.9/site-packages/biopython-1.83-py3.9-linux-x86_64.egg/Bio/SeqUtils/__init__.py)

mgcizzu commented 10 months ago

Thanks Ines, we'll try to fix the biopython thing in the next update. Can you please post your matplotlib error? Thx! Marco

Boehmin commented 10 months ago

Hi Marco,

the matplotlib error was ModuleNotFoundError: No module named 'matplotlib'

I just installed matplotlib and it worked fine.

I now managed to run through the whole notebook (I can post these issues separately if preferred/or change the title of this issue? but here summed up for now): I had the same pandas error as here. I freshly pulled from git yesterday. First I tried to change .append to pd.concat in probedesign.py which did not work obviously. I forced pandas==1.1.5 which fixed it as suggested in the linked issue.

Since I still had to input the cutadapt file path, it took me a while to realise that it was loading the probedesign.py from the .egg module and changing the probedesign.py file would not do much. I had not worked with modules/.egg files before so it was quicker for me to do the following: swap the following line from PLP_directRNA_design import probedesign as plp for

import sys
# Add the folder path to sys.path
sys.path.insert(0, '/user/pathto/PLP_directRNA_design')
import probedesign as plp

Not sure if there is an easier way (or if this might have broken things?).

Another issue I had was in the "Assign Genes to Barcode" section as I tested how="start" on=LbarID. Since it wasn`t entirely clear to me whether to use the LbarID or a different variable/value, I tried all variations of "LbarID"/LbarID, numbers, column IDs, even barcode ID etc (maybe a full example could help?), until I figured out the issue was in the Lprobe_Ver2.csv. For clarity, this is the error I received:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/core/indexes/base.py:3361, in Index.get_loc(self, key, method, tolerance)
   3360 try:
-> 3361     return self._engine.get_loc(casted_key)
   3362 except KeyError as err:

File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/_libs/index.pyx:76, in pandas._libs.index.IndexEngine.get_loc()

File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/_libs/index.pyx:108, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'number'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[21], line 2
      1 customizedlib=r"C:\Users\sergio.salas\Documents\PhD\projects\gene_design\hower_example_5\assigned_gene_LID.csv"
----> 2 probes=plp.build_plps(path,specific_seqs_final,L_probe_library,plp_length,how='start',on="Lbar_ID")

File ~/02-spatial_transcriptomics/01-dataanalysis/08-ISS_processing/PLP_directRNA_design/PLP_directRNA_design/probedesign.py:510, in build_plps(path, specific_seqs_final, L_probe_library, plp_length, how, on)
    508 n=0
    509 for g in gname:
--> 510     gene_names_ID = gene_names_ID.append({"gene": g, "idseq" : np.array(sbh.loc[sbh['number']==ID+n,'ID_Seq'])[0], "Lbar_ID" : str(np.array(sbh.loc[sbh['number']==ID+n,'Lbar_ID'])[0]), "AffyID" : np.array(sbh.loc[sbh['number']==ID+n,'L_Affy_ID'])[0] }, ignore_index=True)
    511     n=n+1
    512 gene_names_ID2=gene_names_ID.set_index("gene", drop = False)

File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/core/frame.py:3458, in DataFrame.__getitem__(self, key)
   3456 if self.columns.nlevels > 1:
   3457     return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
   3459 if is_integer(indexer):
   3460     indexer = [indexer]

File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/core/indexes/base.py:3363, in Index.get_loc(self, key, method, tolerance)
   3361         return self._engine.get_loc(casted_key)
   3362     except KeyError as err:
-> 3363         raise KeyError(key) from err
   3365 if is_scalar(key) and isna(key) and not self.hasnans:
   3366     raise KeyError(key)

KeyError: 'number'

I changed the values of the Lbar_ID rows "LbarID_0" in the .csv to a number (201>) and changed sbh['number'] to sbh['Lbar_ID'] such as below:

for g in gname:
            gene_names_ID = gene_names_ID.append({"gene": g, "idseq" : np.array(sbh.loc[sbh['Lbar_ID']==ID+n,'ID_Seq'])[0], "Lbar_ID" : str(np.array(sbh.loc[sbh['Lbar_ID']==ID+n,'Lbar_ID'])[0]), "AffyID" : np.array(sbh.loc[sbh['Lbar_ID']==ID+n,'L_Affy_ID'])[0] }, ignore_index=True)

Now it runs.

Additionally, I was wondering if you ever had issues with duplicate sequences? I tried this pipeline with a random gene (Zfp85) and got 6 PLP sequences, 2 out of those are duplicates. Here the results attached: good_targetsfinal.csv designed_PLPs_final.csv

Thank you! Ines

mgcizzu commented 10 months ago

Hi Ines, interesting, the pandas thing should have been fixed by a previous update. I'll double check. Same goes for the redundant probes, we had solved this in a previous version of our code, but somehow made it here. Give me a few days to go through the code. M.

mgcizzu commented 10 months ago

Ok here I am. This is for both @Boehmin and @Sverreg (commenting on the issue mentioned here). You get duplicate sequence (or sequences within a +-20 nt range, which overlap and should be excluded, you can check this in the position column) likely because your gene is a bad substrate for the probes. Either it's too short or doesn't comply very well with our GC requirement. When the search doesn't find a number of target equal or superior to the one you specified (default=5) it will automatically return all the targets. Here's the relevant code snipped from the select_sequences functions in PLP_design.py

if ele.shape[0]<number_of_selected:
            selec=ele
        else:    
            for num in range(0,number_of_selected):
                if ele.shape[0]>0:
                    randomlist = random.sample(range(0, ele.shape[0]), 1)
                    sele=ele.iloc[randomlist,:]
                    try:
                        seleall=pd.concat([seleall,sele])
                    except:
                        seleall=sele
                    exclude=list(range(int(sele['Position']-20),int(sele['Position']+20)))
                    ele=ele.loc[~ele['Position'].isin(exclude),:]
            selec=seleall
        selected2=pd.concat([selected2,selec])

I'd suggest to run the search again for these genes relaxing a bit the GC content or taking out the requirement for a terminal G/C. Maybe that will fix the issue. Keep in mind that sometimes it's impossible to design "good"probes against some genes. You can try your chances anyway, design them manually and they might work... Please let me know if my explanation doesn't make sense.

Cheers, Marco

Boehmin commented 10 months ago

Hi Marco,

thank you for the explanation! I`ll keep that in mind and will give this a go. Just to check, would lowering the target requirement to =3 or 4 potentially also help?

Cheers, Ines

mgcizzu commented 9 months ago

Hi Ines, regarding your last question. I think it's wise to have 5 probes (targets) per gene if possible. This will ensure high detection efficiency. While setting the target requirement to 3 or 4 will help solving the issue above, I'd still try to design 5 and do some manual check to remove duplicates and overlaps. Cheers and sorry for the late reply! Marco

mgcizzu commented 9 months ago

I also just fixed the L-probe example file in a way that should now work consistently, and changed again the pandas requirements (turns out I had changed that in a test branch and not merged the changes to the main branch). Please keep finding bugs! Marco

mgcizzu commented 9 months ago

I haven't changed the Biopython requirements, as you really get a warning rather than an error. So I'll try to fix it with a bit more time in a later release :)

Boehmin commented 9 months ago

Hi Marco,

thanks for the tip. I`ll test this again. Maybe you can answer another question. I am trying to design probes that are as species-agnostic as possible between mouse and human (following the description in the pre-print). Do I understand correctly that I should:

run the extract and align sequences for mouse & human
Concatenate results?
run plp.select_sequences() on mouse + human extracted 30kmers (run this on mouse & human separately or combined?)
run plp.map_sequences() on selected sequences above, keep only 30mers found in both species, run this twice; once against human, once against mouse
Continue normally to end like in the tutorial notebook

I`m also slowly going through the ISS analysis notebooks, so I will start a separate issue should I run into bugs there. :)

Boehmin commented 9 months ago

Hi @mgcizzu ,

one more question, is the anchor sequence in the notebook the primer sequence? Since you use a pseudo-anchor as described in the supplementary, I assume the anchor in this notebook is the complementary sequence of your RCA primer? Thank you!

Moldia / Lee_2023

Version issue Biopython when importing Bio.SeqUtils.GC #3